flink
flink copied to clipboard
[FLINK-33584][Filesystems] Update Hadoop Filesystem dependencies to 3.3.6
What is the purpose of the change
This is the PR superceding https://github.com/apache/flink/pull/23740 + adopting hadoop api change usage
Brief change log
- cherry-pick from https://github.com/apache/flink/pull/23740
- flink-filesystems/flink-s3-fs-hadoop/src/main/java/org/apache/flink/fs/s3hadoop/HadoopS3AccessHelper.java
Verifying this change
This change is already covered by existing tests
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (yes )
- The public API, i.e., is any changed class annotated with
@Public(Evolving)
: ( no) - The serializers: ( no)
- The runtime per-record code paths (performance sensitive): ( no )
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: ( no )
- The S3 file system connector: ( no )
Documentation
- Does this pull request introduce a new feature? ( no)
- If yes, how is the feature documented? (not applicable)
CI report:
- 3e03c79a112fe53095b9e798d0c2ba1ebd549c1f Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:-
@flinkbot run azure
re-run the last Azure build
I'll run the S3 tests locally before merging it!
Summary of the situation per @snuyanzin as discussed offline:
seems this HADOOP-18382 broke work of previous upgrades by making s3 client completely private.
this is the commit to 3.3.5 where they merged it https://github.com/apache/hadoop/commit/c2bbff9ea13b730f95c825d355bf4f5153651efc#diff-4050f95b7e3912145415b6e2f9c[…]2ce96bf0980c6c30a6edad2d0fbR1831
the problem is that they introduced WriteOperationHelperCallbacks and private implementation for it WriteOperationHelperCallbacksImpl which is required to be not null in case of s3 and should be passed to WriteOperationHelper constructor. So we can not create
WriteOperationHelperCallbacksImpl since its constructor is private. Since it deals with AmazonS3 in theory we could have our own implementation however somehow we need AmazonS3 object itself to work with it within
WriteOperationHelperCallbacks implementation. The issue is that all the setters are private Currently in PR there was null as a value for this, that's the reason of NPE
currently the only way I see is Calcite like hacks... So we just copy their class in Flink repo, change level of visibility for WriteOperationHelperCallbacksImpl and reuse it another way: they have getAmazonS3ClientForTesting marked as VisibleForTesting and I guess used only there however it allos to get access to AmasonS3 object. Once we have access to that object we could implement our own version of WriteOperationHelperCallbacks and use it. The drawback is that exactly this method they are going to remove soon as well
I noticed that Spark also have similar problem for the same reason and it is still not resolved https://issues.apache.org/jira/browse/SPARK-38958
Hi, Thanks for the PR! Just kindly ping. Could we resolve this by the solution metioned by @MartijnVisser ? We tried to use positionable read with ByteBuffer on hadoop to imporve the performance but found it's only supported after 3.3.0.
Could we resolve this by the solution
See the comment above; how would you resolve the situation?
Most likely we'll have to rewrite/reimplement the entire Hadoop-based filesystem implementations, since they've moved from the V1 API to the V2 API for the AWS SDK. So either upgrade to 3.4.0 (that's where the upgrade was done), or the 3.3.7-aws version.