iceberg
iceberg copied to clipboard
[draft] HADOOP-18679. Add API for bulk/paged object deletion: Iceberg PoC
This is PoC for using the HADOOP-18679. Add API for bulk/paged object deletion api through iceberg. Relies on all FileSystem implementations supporting it, at least for a page size of 1.
https://github.com/apache/hadoop/pull/6726
This PR is in sync with https://github.com/apache/hadoop/pull/6686 ; the dynamic binding classes in here are from that PR (which also copies in the parquet DynMethods classes for consistency everywhere).
the latest update runs the tests against local fs paramterized on using/not using bulk delete -the library settings have been modified to use hadoop 3.4.1-SNAPSHOT for this.
it works, and execution time on large file deletion (added to the listPrefix test) is comparable, but there is now two codepaths to test and maintain.
I plan to coalesce them so test coverage is better, even on older builds
- going to create a separate test repository to run an iceberg build against s3/abfs etc with a different hadoop build. this will do end to end testing.
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.
For anyone watching this, there's a full integration test suite in the hadoop test code: https://github.com/apache/hadoop/pull/7285
All is good, though as it's the first java17 code and depends on an iceberg jar with this patch in, it's actually dependent on this PR going in first. It does show that yes, bulk deletes through the S3A code does work, without me having to add a significant piece of test code to iceberg to wire up hadoop fileIO to the minio docker container that S3FileIO uses. It also makes for a good regression test: anything downstream of both will get neglected.
Anyway, with those tests happy, I just have a few more tests to write (mixing some local files too) and then I'll be confident this is ready for review
Closing as is because now that Hadoop 3.4.1 is the hadoop version, this can be done without reflection.(*)
Doing that in a new PR; this one exists to show the reflection-based solution.
(*) update: it can't as the classpaths are inconsistent; core is built with hadoop 3.3.x. some of the other modules are on 3.4.1 so could be used for testing.
The failures here are due to mocking; if a mocked Hadoop FS lacks the new bulk delete API, hadoop common's attempt to invoke it will fail to link even though there's a default implementation in the base FileSystem class.
Any UnsupportedOperationException raised is now caught and switches to the non-bulk-IO operations.
It does not disable bulk IO, though it is always an option here, as this is not going to get better. However, it should be impossible to encounter in production.
My overall response is that we have too much configuration and abstraction for this.
I feel like in HadoopFileIO we can just use reflection to check if the bulk methods exist and use bulk delete if supported. I'm not sure we need the additional configuration or the additional reflection utilities to support this.
@danielcweeks
I feel like in HadoopFileIO we can just use reflection to check if the bulk methods exist and use bulk delete if supported. I'm not sure we need the additional configuration or the additional reflection utilities to support this.
Happy to do that, and now the diaster of the aws sdk updates are out the way I can look at this. I may have to do broader changes to some other modules to get their mocking tests out the way
yeah, classpaths are mess in places, primarily due to downstream things
- spark pulling in 3.3.6. These need to be excluded and then the new versions pulled in.
- hive-metastore pulling in some hadoop-2.7.6 server artifacts (!!). those can just be excluded, as iceberg-kafka already does.
It does highlight the problem though: there are versions of spark where the installations may be 3.3.6 even though iceberg is building against 3.4.1. That bulk delete API is about the only thing which has changed filesystem-wise.
What to do?
- I'm doing the bulk delete without reflection as requested
- catch all failures invoking it and downgrade to classic invocation -and disabling future attempts to use
- Leave those spark imports of 3.3.6 alone, on the basis they reflect real-world deployments
The fallback logic is unnecessary complexity, we should just rely on the bulk delete if available and log/throw if individual deletes fails
There are a lot of comments in the code that are unnecessary or just commentary and don't help with the readability.
Happy to cut if that's the project style.
The majority of the docs section is unnecessary. If there are specific values that should be set for S3A we should link to the docs for S3A, but we don't need to go into detail about features/classpath deconfliction/etc. Adding a no-config feature really shouldn't require the docs changes at all.
I was trying to include some hints for performance, but can put them in the hadoop docs instead
@danielcweeks there's too much patch history in there that the chained patch application doesn't take...it's complaining about a file which was later deleted.
proposed
- squash all but the changes I've added after your last review
- rebase onto main.
keeping separate the changes since the last review makes clear what changes have happened, while the squash avoids the patch apply chain issues.
does this seem OK?
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.
not sure what went up with flink; rebasing and testing locally to make sure that it is unrelated to my PR.
failure is because spark 3.4 runtimes are using older hadoop releases.
@danielcweeks so here we are
- no reflection
- forced update of hadoop version on spark 3.4 and 3.5.
That forced update isn't great, but if iceberg compiles with hadoop 3.4.1 then it's already possibly broken somewhere else when run there. We put a lot of effort into backward compatibility of APIs, but its really hard to do forward compatibility (openFile(), createFile() and other builders are the best we have here.
A real risk is some overloaded method compiling down differently, different exception signatures etc.
That's independ of this PR, which simply finds the problem faster.
Ignoring that fundamental issue, I could tweak this one to look for the presence of the new API classes via a loadResource call, and don't attempt bulk delete if not found. Provided all uses of the bulk delete is isolated to guarded method, this wouldn't exacerbate the existing issue.
How to test if that worked? remove the forced dependency updates from the spark versions.
Thoughts? It's easily done, and isn't the full reflection game, just a probe and a guard
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.
aah, this got closed while I was off on a european-length vacation. PITA.
@steveloughran PRs get auto-closed after a certain period but you can always revive it
@nastra thanks. I think I'll split the "move everything to hadoop 3.4.1 libs" from the code changes, so that build changes which highlight spark 3.4 issues.
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.