stocator icon indicating copy to clipboard operation
stocator copied to clipboard

Missing glob support when reading files

Open mtsargent opened this issue 6 years ago • 7 comments

When reading multiple files at once with Spark, I would expect to use wildcards/other general glob patterns (similar to the answer https://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd/24036343). Example repeated here for simplicity:

sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")

When using Stocator, attempting to read files in this way fails: val junkcsv = spark.sqlContext.read.option("header", "true").load("cos://some-bucket.myCos/somefile.csv/part-0000[0-1]*")

Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: cos://some-bucket.myCos/somefile.csv/part-0000[0-1]*;

This failure happens even when there are files I would expect to match that pattern like:

cos://some-bucket.myCos/somefile.csv/part-00000.csv cos://some-bucket.myCos/somefile.csv/part-00001.csv

The lack of glob support seems to be coming from the ObjectStoreFlatGlobFilter class: https://github.com/CODAIT/stocator/blob/c18f37b6dfc119e5ebfd2bf12c57de989e4a5ad5/src/main/java/com/ibm/stocator/fs/common/ObjectStoreFlatGlobFilter.java#L128-L134

The only type of matching attempted is a simple wildcard match, rather than an actual attempt at globbing.

The java.nio package may be able to support this type of matching. I have not yet built a custom version of Stocator, but the following matching code seems promising:

PathMatcher pm = FileSystems.getDefault().getPathMatcher("glob:" + pathPattern.replaceAll("//", "/"));
Path newPath = FileSystems.getDefault().getPath(pathStr);

match = pm.matches(newPath);

I am not familiar enough with the rest of the Stocator codebase to know if adding in this type of matching breaks other parts of the code drastically.

mtsargent avatar Oct 25 '19 19:10 mtsargent

@mtsargent you are not suppose to access parts of the file. This is general Hadoop eco-system usage. Parts are internal files, that were created by distributed tasks. You should never access parts directly, rather you need to use ("cos://some-bucket.myCos/somefile.csv") and then globber is supported of course.

gilv avatar Oct 26 '19 06:10 gilv

Fair point about part files, but would you anticipate the stocator globber to work with non-part files?

Suppose I try to use this to read in multiple files:

"cos://some-bucket.myCos/file-00[0-2]*"

Would you expect this to read in all of the following from my COS bucket?

file-000.txt file-001.txt file-002.txt

While also ignoring other files. Example:

file-003.txt file-004.txt

I suppose I can just set up this scenario and test it out.

mtsargent avatar Oct 26 '19 14:10 mtsargent

@mtsargent i expect exactly as you wrote. if this doesn't work, then it's a bug in Stocator and need to be fixed of course.

gilv avatar Oct 26 '19 14:10 gilv

@mtsargent however it's not clear how to make ranges in [x-y]...if it's numeric or literal is important to know. for example, [aaxy-xyba], what you expect to have? there might be thousands of objects, how to identify them? or you need only numeric, [1-100], will be 1,2,..,99,100?

gilv avatar Oct 26 '19 14:10 gilv

I think each expression in brackets only corresponds to a single character. The syntax I am familiar with is described here: http://man7.org/linux/man-pages/man7/glob.7.html. [aaxy-xyba] would be the same as a single character match out of [abxy], and [1-100] would be a single character match the same as writing [01] or [0-1].

At the very least, I can set up this test next time I am around my work computer. I can update this issue one way or the other (and can close the issue if matching works as expected).

mtsargent avatar Oct 26 '19 14:10 mtsargent

@mtsargent thanks. I think we support {} right now, [] is not supported, but i need double check. At least i don't see unitests for [], only for {} https://github.com/CODAIT/stocator/blob/master/src/test/java/com/ibm/stocator/fs/cos/systemtests/TestCOSGlobberBracketStocator.java

Will you be able to extend code to support also [] ? will be great if you can work on it..

gilv avatar Oct 27 '19 07:10 gilv

This may be something I can try to take on. It likely wouldn't be for a few weeks at the earliest.

mtsargent avatar Oct 28 '19 02:10 mtsargent