gobblin
gobblin copied to clipboard
[GOBBLIN-412] Set compaction's compression params
mr.compact.sample.txt Make compaction's compression params configurable through job config
JIRA
- https://issues.apache.org/jira/browse/GOBBLIN-412
Description
Parameters to control compression-
- mapreduce.output.fileoutputformat.compress
- mapreduce.output.fileoutputformat.compress.codec
- mapreduce.output.fileoutputformat.compress.type
are not passed on to Hadoop from compaction job configuration file. In effect, these parameter's value are always picked up from mapred-site.xml.
Following Three new parameters are introduced as part of the fix to control compression behavior in compaction job
- compaction.job.mapreduce.output.fileoutputformat.compress
- compaction.job.mapreduce.output.fileoutputformat.compress.codec
- compaction.job.mapreduce.output.fileoutputformat.compress.type
Tests
- Set compaction.job.mapreduce.output.fileoutputformat.compress and verify output of compaction job. Output should be compressed with default codec.
- Reset compaction.job.mapreduce.output.fileoutputformat.compress and verify output of compaction job. Output shouldn't be compressed.
- Set compaction.job.mapreduce.output.fileoutputformat.compress and set compaction.job.mapreduce.output.fileoutputformat.compress to org.apache.hadoop.io.compress.SnappyCodec and verify output of compaction job. Output should be compressed with Snappy codec.
- Set compaction.job.mapreduce.output.fileoutputformat.compress and set compaction.job.mapreduce.output.fileoutputformat.compress to org.apache.hadoop.io.compress.DefaultCodec and verify output of compaction job. Output should be compressed with Deflate codec.
- Do not set introduced parameters and remove compaction.job.mapreduce.output.fileoutputformat.compress and mapred.output.compress parameters from mapred-site.xml. Output should be compressed with Deflate codec.
- Run test case (4) with compaction.job.mapreduce.output.fileoutputformat.compress.type set to RECORD and verify output of compaction job. Output should be compressed with Deflate codec. Output should be compressed with Snappy codec and record level.
- Run test case (4) with compaction.job.mapreduce.output.fileoutputformat.compress.type set to BLOCK and verify output of compaction job. Output should be compressed with Deflate codec. Output should be compressed with Snappy codec and block level.
- Remove compression property from hadoop config and set compaction.job.mapreduce.output.fileoutputformat.compress in job config. Compaction output should be compressed.
- Remove compression property from hadoop config and reset compaction.job.mapreduce.output.fileoutputformat.compress in job config. Compaction output shouldn't be compressed.
- Remove compression property from hadoop config and remove compaction.job.mapreduce.output.fileoutputformat.compress from job config. Compaction output should be compressed (Preserve code default behavior)
Please review.
@sushantpande @htran1 ping to get traction again on this :)
@sushantpande @htran1 ^^
@htran1 @abti @sushantpande Resubmitted the PR as per the suggestions. https://github.com/apache/incubator-gobblin/pull/2386
@htran1 ^^