elephant-bird
elephant-bird copied to clipboard
DelegateCombineFileInputFormat Doesn't Honor CombineFileInputFormat.maxSplitSize
I'm trying to use DelegateCombineFileInputFormat + LzoTextInputFormat + LzoTextOutputFormat. I'm also trying to specify the maxSplitSize for combining files. I've found that DelegateCombineFileInputFormat doesn't honor maxSplitSize, minSplitSizeNode, or minSplitSizeRack if they are configured before the job is run.
Per @jcoveney "If there is a maxInputSplitSize in Hadoop's CombineFileInputFormat no, it is not honored.": https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/util/SplitUtil.java#L35
I can see a couple approaches for a fix:
-
SplitUtil.getCombinedSplitSize(Configuration): Change it so it tries to getLong from COMBINE_SPLIT_SIZE, if it can't it'll try to get from CombineFileInputFormat "mapreduce.input.fileinputformat.split.maxsize" which apparently isn't a static constant, but a hard coded string...
-
DelegateCombineFileInputFormat could set SplitUtil.COMBINE_SPLIT_SIZE equal to CombineFileInputFormat max split size if it was set. This same approach could be used for minSplitSizeNode and minSplitSizeRack. Where in DelegateCombineFileInputFormat would this go?
I think we should do a mix of what you propose.
- DelegateCombineFileInputFormat should check for and honor CombineFileInputFormat info, and pass it to getcombinedsplitSize
getCombinedSplitSize should also check for both.
That does seem the safe approach. What about for minSplitSizeNode and minSplitSizeRack? I'd think to extract from conf using CFIF config keys, then set them into EB's SplitUtil conf keys too? However it doesn't appear that DelegateCFIF has any notion of a min size.
I've created a pull request: https://github.com/kevinweil/elephant-bird/pull/420