apache spark - set outputCommitterClass property in hadoop2 -
i have been researching problem past few weeks, , didn't find clear answer.
here problem:
for hadoop1x (in mapred
lib), use customized output committer using:
spark.conf.set( "spark.hadoop.mapred.output.committer.class", "some committer" )
or calling jobconf.setoutputcommitter
.
however, hadoop2x (in mapreduce lib), gets committer outputformat.getoutputcommitter
, there no clear answer on how setoutputcommitter
.
i found databricks set output committer using property, spark.hadoop.spark.sql.sources.outputcommitterclass
.
i tried netflix's s3 committer(com.netflix.bdp.s3.s3directoryoutputcommitter
), in log, spark still uses default committer:
17/09/13 22:39:36 info fileoutputcommitter: file output committer algorithm version 2 17/09/13 22:39:36 info directfileoutputcommitter: nothing clean since no temporary files written. 17/09/13 22:39:36 info csemultipartuploadoutputstream: close closed:false s3://xxxx/testtable3/.hive-staging_hive_2017-09-13_22-39-34_140_3769635956945982238-1/-ext-10000/_success
i'm wondering if it's possible overwrite default fileoutputcommitter
, use customized committer in mapreduce
lib?
how do it?
not no; it's i'm trying fix mapreduce-6823 -where you'll able set committer per filesystem schema. won't surface while (hadoop 3.1?)
you should able away setting sql output committer, though i'd check path. kicks in sql/dataframe work. can set parquet separately, though committer declare must subclass of parquetoutputcommitter, netflix 1 isn't.
Comments
Post a Comment