scalding icon indicating copy to clipboard operation
scalding copied to clipboard

InputSizeEstimator should use right paths to estimate size.

Open dieu opened this issue 8 years ago • 4 comments

to discuss the right implementation of #1652.

Original problem:

  1. original implementation was based on tap.getPath, but some sources return the wrong path of source files/dirs because internal implementation of this sources needs JobConf to compute the right path (for instance VKVS).

Basically, it should be implemented inside tap.getSize(JobConf), but unfortunately, cascading implementation of getSize in Hfs tap do calculations without respect of paths with glob patterns (it will raise java.io.FileNotFoundException for path like /logs/*).

A Safer approach to address this:

  1. Calculate input size only for Hfs taps
  2. Ask tap.getSize(JobConf) as the first step, if Tap understands how to return the right size of the source, we should use it.
  3. If tap.getSize(JobConf) return 0 or raise any exception, we should try to calculate size by ourselves.

the question is:

  1. If we can not calculate the size of Hfs taps, should we fail? or we should disable InputSizeEstimator and fallback to defaults?
  2. If we can not calculate the size only for one of Hfs tap?

cc: @isnotinvain / @johnynek / @piyushnarang

dieu avatar Mar 10 '17 19:03 dieu

The worries with fallbacks in case of issues calculating the size of a given Hfs is that it's less likely for us to catch issues (not too many users will be paying attention to their logs and seeing 1 or n taps failed to be size estimated) and I think like we've discussed in different contexts, it might result in different number of reducers in different runs. I think the other risk with swallowing 1 of n tap's failing to estimate is that it might have been the tap with the largest amount of data.

Would it make sense to allow users to make the fallback to defaults explicit if they really want it? If they don't say they want a defaults fallback, we fail in case of issues?

piyushnarang avatar Mar 20 '17 17:03 piyushnarang

How about we fix Hfs tap's .getSize method? Or create a sub-class of Hfs like GlobHfs whose .getSizes method works correctly?

I don't think we should ignore exceptions. Best guesses usually lead to more headaches for us than failing fast.

isnotinvain avatar Mar 20 '17 19:03 isnotinvain

@isnotinvain

  1. Hfs tap is cacading code, I'm not sure that we can fix it.
  2. Create GlobHfs is possible, but then we should change everywhere usage of Hfs to GlobHfs, that hard to test and more dangerous.

dieu avatar Mar 22 '17 00:03 dieu

@isnotinvain Try to adopt your suggestion about sub-class of Hfs. please check again #1652

dieu avatar Mar 22 '17 19:03 dieu