druid Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Fixes https://github.com/apache/druid/pull/12659.

Description

Removed:

import org.apache.commons.io.FilenameUtils;

Add:

import java.nio.file.FileSystems;
import java.nio.file.PathMatcher;
import java.nio.file.Paths;

org.apache.commons.io.FilenameUtils failed to stop at the correct folder structure.

Assert.assertTrue(FilenameUtils.wildcardMatch("db/date=2022-08-01/001.parquet", "db/date=2022-08-01/*.parquet"));

// This proved that FilenameUtils did the wrong thing. It should have been false.
Assert.assertTrue(FilenameUtils.wildcardMatch("db/date=2022-08-01/_junk/0/001.parquet", "db/date=2022-08-01/*.parquet"));

java.nio.file.FileSystems does the right thing.

PathMatcher m2 = FileSystems.getDefault().getPathMatcher("glob:db/date=2022-08-01/*.parquet");
Assert.assertTrue(m2.matches(Paths.get("db/date=2022-08-01/001.parquet")));
Assert.assertFalse(m2.matches(Paths.get("db/date=2022-08-01/_junk/0/001.parquet")));

Without being able to stop at arbitrary subfolders, the utility of glob (*) syntax is severely limited.

Key changed/added classes in this PR

S3InputSource
OssInputSource
GoogleCloudStorageInputSource
AzureInputSource
CloudObjectInputSource

This PR has:

[x] been self-reviewed.
- [ ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
[ ] added documentation for new or modified features or behaviors.
[ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
[ ] added or updated version, license, or notice information in licenses.yaml
[ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
[x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
[ ] added integration tests.
[x] been tested in a test Druid cluster.

Sep 04 '22 15:09 didip

This pull request introduces 2 alerts when merging 5ee673ab6d02c210efed99b6a1e1c1ecd92a1d2f into 6805a7f9c243445b95090e9aa777512f6411b506 - view on LGTM.com

new alerts:

2 for Uncontrolled data used in path expression

Sep 05 '22 19:09 lgtm-com[bot]

This pull request introduces 2 alerts when merging 172521a9ecc3d01519f53a2158ac9e2643611916 into 6805a7f9c243445b95090e9aa777512f6411b506 - view on LGTM.com

new alerts:

2 for Uncontrolled data used in path expression

Sep 05 '22 21:09 lgtm-com[bot]

The last error I am seeing seems unrelated to my work.

Sep 06 '22 03:09 didip

Hmm. IMO, we should definitely change something, since the behavior of applying FilenameUtils.wildcardMatch to a path is just really weird. For example, this returns true:

FilenameUtils.wildcardMatch("a/b/c.txt", "a*.txt")

Which is weird since no real shell works this way. Generally it is expected that * does not match /.

This patch swaps in a globbing implementation where * doesn't match /, and changes the examples to use **.suffix (which does match / in this implementation) instead of *.suffix.

However, there is another way to fix it that IMO is better. We could change the filter glob to match file names rather than paths. Then, *.suffix would still work fine. It's closer to what the local input source does. It's also close to what the find Unix command does when you do find [directory] -name [glob]. (It searches in some directory, and its subdirectories, for files whose names match the provided glob.)

I like this way better because it avoids the awkward ** construction in the examples, and avoids the need for people to think about entire paths in their minds: they can simply think about the file names. (One reason to avoid working with entire paths is that gets weird with cloud files. Like, in s3://a/b, will the path-glob be applied to s3://a/b, or /a/b/, or /b, or b? Better to dodge the question entirely by using names, i.e. apply it to b alone.)

Sep 07 '22 08:09 gianm

I do like the find like approach as well. That's easier to understand. It might not be as powerful but I will happily trade that off with the ease of use and understanding.

Sep 07 '22 09:09 abhishekagarwal87

@gianm question, if it only matched just the filename, we lose the ability to filter out junk subfolders per example below.

("db/date=2022-08-01/001.parquet", "db/date=2022-08-01/*.parquet"));
("db/date=2022-08-01/_junk/0/001.parquet", "db/date=2022-08-01/*.parquet"));

The parquet filename in _junk can be too similar to the parquet filenames that we actually want to consume.

Sometimes, we don't have control over the input folder, this globing technique will allow us to do so.

Sep 10 '22 01:09 didip

That's true, it's an ease-of-use vs. power tradeoff. I am legitimately worried that people will find it confusing to reason about whether, for example, prefix s3://a/b plus filter b/*txt applies to s3://a/b.txt or s3://a/b/c.txt or both or neither.

Is this case (where there is a _junk subdirectory that we want to ignore) common? If it doesn't happen, we don't need to worry about it. If it happens, but is uncommon, maybe we can deal with it some other way?

Sep 11 '22 17:09 gianm

@gianm I would say it is quite common when people are massaging data using Spark into Iceberg.

Sep 11 '22 22:09 didip

FWIW, I think it's best if the feature lets us read data folders that standard big data pipelines generate. Someone will likely ask for it soon enough if it's a common pattern. so just like that, I am going to change my vote again and say that we go ahead with this glob notation :)

Sep 12 '22 05:09 abhishekagarwal87

@gianm I would say it is quite common when people are massaging data using Spark into Iceberg.

@didip would you mind giving an example of how people would use the filter glob to read Iceberg data? I'm not familiar with how Iceberg stores data, so this would help me understand how the feature is likely to be used.

I continue to be concerned about the confusingness of whole-path globs, so, I do think if we ship the feature then we should be really clear about how it works. Docs should explain what string is used as the path for the match. For example, if your prefix is s3://mybucket/myprefix, and there is an object s3://mybucket/myprefix/foo/bar.txt then is the path /foo/bar.txt (the part after the prefix) or is it foo/bar.txt (the part after the prefix, with leading / stripped), or is it myprefix/foo/bar.txt (the entire S3 object key)?

We could also offer both filter (name glob) and pathFilter (whole-path glob) options. That way, consistency with local input source is preserved (where its filter applies to filenames only), and also, users with simple use cases that don't involve Iceberg integration can have a more-intuitive name-based matching.

Sep 13 '22 00:09 gianm

@gianm Sure thing.

All of our customers (data scientists & data engineers, hundreds of them) use Spark to massage data for Druid to consume. They typically do a number of transformations and at the end the push raw parquet files to a specific S3 bucket+path for Druid.

Now, these people have varying proficiency in Spark. Some are extremely good, some are barely passable. Spark does a lot of things behind the scene and they usually involved in creating _temporary folders, e.g. during merge or shuffling. These are the junk folders I talked about.

It's incredibly tedious to chase these down and remind my customers to clean up (because these folders are not errors and they are upstream, outside of my Druid workflow). Some are capable to clean up, some completely ignored my requests (but later complained when their count data didn't match because the parquet files inside _temporary folder is accidentally ingested).

As for the confusion with s3://mybucket/path/to/parquet confusion, I can provide a convenience method to strip them out, that should help reducing confusion.

As for filename filter, I am not sure if it's that useful if you have the full path glob feature. It might introduce another confusion by having 2 different filter JSON attributes.

Sep 13 '22 01:09 didip

@gianm ok, I made an improvement, if users put blob store scheme and bucket, the filtering logic will ignore them.

Sep 13 '22 02:09 didip

This pull request introduces 2 alerts when merging dd925886305618531977d7ef5c39b54e23dddeb8 into 08d6aca52865ab7bcbf75e16c4ca153bd3ca06f3 - view on LGTM.com

new alerts:

2 for Uncontrolled data used in path expression

Sep 13 '22 04:09 lgtm-com[bot]

This pull request introduces 2 alerts when merging 56426aa0d3c3b200896047395727bf4c2dc27531 into 08d6aca52865ab7bcbf75e16c4ca153bd3ca06f3 - view on LGTM.com

new alerts:

2 for Uncontrolled data used in path expression

Sep 13 '22 07:09 lgtm-com[bot]

@gianm @abhishekagarwal87 Any further thought?

Oct 01 '22 15:10 didip

I really think using **.parquet is a small price to pay. Especially considering that this is a standard Java syntax.

Oct 01 '22 15:10 didip

I'm considering a few things about the design:

We want the power of doing a whole-path glob. @didip describes scenarios where this additional power makes it more useful than a filename-only glob: https://github.com/apache/druid/pull/13027#issuecomment-1244769844
We use filter for filename globs in the local input source. So, for consistency I think it's best for filter to mean filename glob. @abhishekagarwal87 pointed this out in a comment: https://github.com/apache/druid/pull/13027#discussion_r985988603
I am concerned that there is some ambiguity with how whole-path globs should work. They need to be "rooted" somewhere, and there's various choices for where to root them.

There is another problem with the connection between the last two. In this patch, filter on cloud files input sources applies to both prefixes and uris. But filter in local input sources only applies to baseDir (similar to prefixes), not to files (similar to uris). IMO, best to avoid the term filter altogether for cloud files for now. Otherwise these inconsistencies will become a point of confusion.

I think we can solve all of the above problems by calling this new feature objectGlob, and having it apply to the object part of the uri regardless of whether prefixes or uris is specified. By "object part", I mean that for s3://foo/bar/baz then the glob applies to bar/baz.

Thoughts?

Oct 04 '22 05:10 gianm

@gianm I definitely like calling this feature as a new property because blob store path is apparently different enough than regular file system path. I am down calling it objectGlob.

Agree with this portion as well:

and having it apply to the object part of the uri regardless of whether prefixes or uris is specified. By "object part", I mean that for s3://foo/bar/baz then the glob applies to bar/baz.

Oct 04 '22 23:10 didip

This pull request introduces 2 alerts when merging 9eedc53ffbbb3ea0b2b0aacc8e394efb4a8e7f86 into 41e51b21c3396a78c7a6b736bcb87903876991b2 - view on LGTM.com

new alerts:

2 for Uncontrolled data used in path expression

Oct 05 '22 07:10 lgtm-com[bot]

ok, @gianm what do you think now?

Oct 05 '22 16:10 didip

@abhishekagarwal87 @gianm thought?

Oct 11 '22 03:10 didip

@gianm @abhishekagarwal87 ok, I addressed the 3 comments. We'll see what the CI results look like.

Oct 17 '22 04:10 didip

This pull request introduces 2 alerts when merging 207e6d60980139ab6ab9afad45882e195230d138 into 3bbb76f17bab32b3d3e12a472e8403affeb09108 - view on LGTM.com

new alerts:

2 for Uncontrolled data used in path expression

Oct 17 '22 06:10 lgtm-com[bot]

This pull request introduces 2 alerts when merging d101b2864684a4899351d76b9b3af0158ebba095 into d98c808d3f087645d8dcfd516e77b0058bdfd389 - view on LGTM.com

new alerts:

2 for Uncontrolled data used in path expression

Oct 25 '22 00:10 lgtm-com[bot]

@gianm @abhishekagarwal87 ok, the rest of the failures are due to unrelated Python errors. What do you think now?

Oct 25 '22 02:10 didip

This pull request introduces 2 alerts when merging 98633fb91014ff553f60cfec1183453eabc78f4a into d98c808d3f087645d8dcfd516e77b0058bdfd389 - view on LGTM.com

new alerts:

2 for Uncontrolled data used in path expression

Oct 25 '22 03:10 lgtm-com[bot]

This pull request introduces 2 alerts when merging 37a026cfcb8ad860d444a7370b63048fdbe6c1f2 into d851985cf5b0faf8e6630c3011dc80288e959f28 - view on LGTM.com

new alerts:

2 for Uncontrolled data used in path expression

Oct 29 '22 04:10 lgtm-com[bot]

An unrelated typescript check failed:

 FAIL  e2e-tests/multi-stage-query.spec.ts (302.447 s)
  ● Multi-stage query › runs a query that reads external data
    thrown: "Exceeded timeout of 300000 ms for a test.
    Use jest.setTimeout(newTimeout) to increase the timeout value, if this is a long-running test."

@abhishekagarwal87 @gianm Any thought on the PR so far?

Oct 29 '22 06:10 didip

This pull request introduces 2 alerts when merging 4c89a1bfbb010c1ac3d76a4fe0d71a13d5813d2d into 675fd982fb5ca274b057495a90563ecc248ad823 - view on LGTM.com

new alerts:

2 for Uncontrolled data used in path expression

Oct 31 '22 18:10 lgtm-com[bot]

This pull request introduces 2 alerts when merging 14069f1d563c02260add3da56cf521fc22b930ef into 176934e849d50a2360f50db7771c1775110308f3 - view on LGTM.com

new alerts:

2 for Uncontrolled data used in path expression

Nov 01 '22 19:11 lgtm-com[bot]

druid druid copied to clipboard

Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Description

Key changed/added classes in this PR

druid
druid copied to clipboard