scalding icon indicating copy to clipboard operation
scalding copied to clipboard

Known issues with FileSource path validation

Open isnotinvain opened this issue 8 years ago • 0 comments

Some discussion can be seen in #1474 about the "best effort" nature of FileSource's validations, but I'm opening this ticket to track the currently known issues.

Empty directories w/ Success Files:

Currently, empty directories with SUCCESS files are not accepted by SuccessFileSource. #1591 Adds support for SuccessFileSource to accept empty directories containing a SuccessFile, so that is addressed there (this was previously not supported).

hdfsPaths using compact globs vs expanded globs

The nature of using glob patterns for finding files / directories has some issues that make correctness difficult. For example, if hdfsPaths returns Seq("a/b/c/{x,y,z}/*") this appears (in my opinion) to be a request from the user for directories x, y, and z to all be valid according to the pathIsGood rules. However, because it's a single glob, we can only do validation on whatever files come back as matches to that glob. For example, if directory x simply doesn't exist, that can't be considered an error, because we won't actually see anything about directory x in the returned list of files that match this pattern. The only way around this would be to parse the glob pattern and try to understand that a {} clause was used.

The current workaround for this, which TimePathedSource* uses, is instead of returning a seq like Seq("a/b/c/{x,y,z}") from hdfsPaths, it instead returns Seq("a/b/c/x/*", "a/b/c/y/*", "a/b/c/z/*") -- this way each item in hdfsPaths can be separately validated and the case where x is missing entirely is handled. This is a somewhat confusing API, though not entirely incorrect.

Race conditions

I don't have all the details on this, but there are a few times where validation needs to happen, in createTaps, in validateTaps, and then when the job is actually launched. There can be time between these steps and the filesystem may have changed. I think some PRs have been merged to address this but we can add more details here as they come up.

Others

I'm sure there are more subtle issues w/ FileSource's use of glob patterns (one that was fixed I think was requiring a trailing /* in some cases and not allowing one in others). We can use this ticket to track other issues relating to FileSource and validating hdfs paths.

Potential Refactoring for the future

One idea would be to make a non-globs based FileSource, that instead of accepting glob patterns, simply accepts a list of directories or files that the user wants to load. These directories and files can be validated directly, and all the globs related issues could be bypassed. I think one reason we have been using globs is for name node performance (a single globStatus call is one name node RPC). I think we could revisit whether the cost of querying per-directory is too high, or we could maybe look into a way to do a batch query to the name node.

isnotinvain avatar Sep 28 '16 21:09 isnotinvain