elephant-bird icon indicating copy to clipboard operation
elephant-bird copied to clipboard

DelegateCombineFileInputFormat Doesn't Honor CombineFileInputFormat.maxSplitSize

Open gsteelman opened this issue 10 years ago • 3 comments

I'm trying to use DelegateCombineFileInputFormat + LzoTextInputFormat + LzoTextOutputFormat. I'm also trying to specify the maxSplitSize for combining files. I've found that DelegateCombineFileInputFormat doesn't honor maxSplitSize, minSplitSizeNode, or minSplitSizeRack if they are configured before the job is run.

Per @jcoveney "If there is a maxInputSplitSize in Hadoop's CombineFileInputFormat no, it is not honored.": https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/util/SplitUtil.java#L35

I can see a couple approaches for a fix:

  1. SplitUtil.getCombinedSplitSize(Configuration): Change it so it tries to getLong from COMBINE_SPLIT_SIZE, if it can't it'll try to get from CombineFileInputFormat "mapreduce.input.fileinputformat.split.maxsize" which apparently isn't a static constant, but a hard coded string...

  2. DelegateCombineFileInputFormat could set SplitUtil.COMBINE_SPLIT_SIZE equal to CombineFileInputFormat max split size if it was set. This same approach could be used for minSplitSizeNode and minSplitSizeRack. Where in DelegateCombineFileInputFormat would this go?

gsteelman avatar Sep 29 '14 21:09 gsteelman

I think we should do a mix of what you propose.

  1. DelegateCombineFileInputFormat should check for and honor CombineFileInputFormat info, and pass it to getcombinedsplitSize

getCombinedSplitSize should also check for both.

jcoveney avatar Sep 29 '14 21:09 jcoveney

That does seem the safe approach. What about for minSplitSizeNode and minSplitSizeRack? I'd think to extract from conf using CFIF config keys, then set them into EB's SplitUtil conf keys too? However it doesn't appear that DelegateCFIF has any notion of a min size.

gsteelman avatar Sep 29 '14 21:09 gsteelman

I've created a pull request: https://github.com/kevinweil/elephant-bird/pull/420

gsteelman avatar Sep 30 '14 01:09 gsteelman