scalding
scalding copied to clipboard
InputSizeEstimator should use right paths to estimate size.
to discuss the right implementation of #1652.
Original problem:
- original implementation was based on
tap.getPath, but some sources return the wrong path of source files/dirs because internal implementation of this sources needsJobConfto compute the right path (for instance VKVS).
Basically, it should be implemented inside tap.getSize(JobConf), but unfortunately, cascading implementation of getSize in Hfs tap do calculations without respect of paths with glob patterns (it will raise java.io.FileNotFoundException for path like /logs/*).
A Safer approach to address this:
- Calculate input size only for
Hfstaps - Ask
tap.getSize(JobConf)as the first step, if Tap understands how to return the right size of the source, we should use it. - If
tap.getSize(JobConf)return 0 or raise any exception, we should try to calculate size by ourselves.
the question is:
- If we can not calculate the size of
Hfstaps, should we fail? or we should disable InputSizeEstimator and fallback to defaults? - If we can not calculate the size only for one of
Hfstap?
cc: @isnotinvain / @johnynek / @piyushnarang
The worries with fallbacks in case of issues calculating the size of a given Hfs is that it's less likely for us to catch issues (not too many users will be paying attention to their logs and seeing 1 or n taps failed to be size estimated) and I think like we've discussed in different contexts, it might result in different number of reducers in different runs. I think the other risk with swallowing 1 of n tap's failing to estimate is that it might have been the tap with the largest amount of data.
Would it make sense to allow users to make the fallback to defaults explicit if they really want it? If they don't say they want a defaults fallback, we fail in case of issues?
How about we fix Hfs tap's .getSize method? Or create a sub-class of Hfs like GlobHfs whose .getSizes method works correctly?
I don't think we should ignore exceptions. Best guesses usually lead to more headaches for us than failing fast.
@isnotinvain
- Hfs tap is cacading code, I'm not sure that we can fix it.
- Create GlobHfs is possible, but then we should change everywhere usage of Hfs to GlobHfs, that hard to test and more dangerous.
@isnotinvain Try to adopt your suggestion about sub-class of Hfs. please check again #1652