gobblin icon indicating copy to clipboard operation
gobblin copied to clipboard

[GOBBLIN-412] Set compaction's compression params

Open sushantpande opened this issue 7 years ago • 5 comments

mr.compact.sample.txt Make compaction's compression params configurable through job config

JIRA

  • https://issues.apache.org/jira/browse/GOBBLIN-412

Description

Parameters to control compression-

  1. mapreduce.output.fileoutputformat.compress
  2. mapreduce.output.fileoutputformat.compress.codec
  3. mapreduce.output.fileoutputformat.compress.type

are not passed on to Hadoop from compaction job configuration file. In effect, these parameter's value are always picked up from mapred-site.xml.

Following Three new parameters are introduced as part of the fix to control compression behavior in compaction job

  1. compaction.job.mapreduce.output.fileoutputformat.compress
  2. compaction.job.mapreduce.output.fileoutputformat.compress.codec
  3. compaction.job.mapreduce.output.fileoutputformat.compress.type

Tests

  1. Set compaction.job.mapreduce.output.fileoutputformat.compress and verify output of compaction job. Output should be compressed with default codec.
  2. Reset compaction.job.mapreduce.output.fileoutputformat.compress and verify output of compaction job. Output shouldn't be compressed.
  3. Set compaction.job.mapreduce.output.fileoutputformat.compress and set compaction.job.mapreduce.output.fileoutputformat.compress to org.apache.hadoop.io.compress.SnappyCodec and verify output of compaction job. Output should be compressed with Snappy codec.
  4. Set compaction.job.mapreduce.output.fileoutputformat.compress and set compaction.job.mapreduce.output.fileoutputformat.compress to org.apache.hadoop.io.compress.DefaultCodec and verify output of compaction job. Output should be compressed with Deflate codec.
  5. Do not set introduced parameters and remove compaction.job.mapreduce.output.fileoutputformat.compress and mapred.output.compress parameters from mapred-site.xml. Output should be compressed with Deflate codec.
  6. Run test case (4) with compaction.job.mapreduce.output.fileoutputformat.compress.type set to RECORD and verify output of compaction job. Output should be compressed with Deflate codec. Output should be compressed with Snappy codec and record level.
  7. Run test case (4) with compaction.job.mapreduce.output.fileoutputformat.compress.type set to BLOCK and verify output of compaction job. Output should be compressed with Deflate codec. Output should be compressed with Snappy codec and block level.
  8. Remove compression property from hadoop config and set compaction.job.mapreduce.output.fileoutputformat.compress in job config. Compaction output should be compressed.
  9. Remove compression property from hadoop config and reset compaction.job.mapreduce.output.fileoutputformat.compress in job config. Compaction output shouldn't be compressed.
  10. Remove compression property from hadoop config and remove compaction.job.mapreduce.output.fileoutputformat.compress from job config. Compaction output should be compressed (Preserve code default behavior)

sushantpande avatar Feb 23 '18 13:02 sushantpande

Please review.

sushantpande avatar Feb 23 '18 15:02 sushantpande

@sushantpande @htran1 ping to get traction again on this :)

abti avatar Mar 23 '18 19:03 abti

@sushantpande @htran1 ^^

abti avatar May 04 '18 18:05 abti

@htran1 @abti @sushantpande Resubmitted the PR as per the suggestions. https://github.com/apache/incubator-gobblin/pull/2386

SivaAccionLabs avatar Jun 14 '18 06:06 SivaAccionLabs

@htran1 ^^

abti avatar Jun 14 '18 23:06 abti