flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-33584][Filesystems] Update Hadoop Filesystem dependencies to 3.3.6

Open snuyanzin opened this issue 1 year ago • 6 comments

What is the purpose of the change

This is the PR superceding https://github.com/apache/flink/pull/23740 + adopting hadoop api change usage

Brief change log

  • cherry-pick from https://github.com/apache/flink/pull/23740
  • flink-filesystems/flink-s3-fs-hadoop/src/main/java/org/apache/flink/fs/s3hadoop/HadoopS3AccessHelper.java

Verifying this change

This change is already covered by existing tests

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes )
  • The public API, i.e., is any changed class annotated with @Public(Evolving): ( no)
  • The serializers: ( no)
  • The runtime per-record code paths (performance sensitive): ( no )
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: ( no )
  • The S3 file system connector: ( no )

Documentation

  • Does this pull request introduce a new feature? ( no)
  • If yes, how is the feature documented? (not applicable)

snuyanzin avatar Nov 30 '23 13:11 snuyanzin

CI report:

  • 3e03c79a112fe53095b9e798d0c2ba1ebd549c1f Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Nov 30 '23 13:11 flinkbot

I'll run the S3 tests locally before merging it!

MartijnVisser avatar Dec 02 '23 10:12 MartijnVisser

Summary of the situation per @snuyanzin as discussed offline:

seems this HADOOP-18382 broke work of previous upgrades by making s3 client completely private.

this is the commit to 3.3.5 where they merged it https://github.com/apache/hadoop/commit/c2bbff9ea13b730f95c825d355bf4f5153651efc#diff-4050f95b7e3912145415b6e2f9c[…]2ce96bf0980c6c30a6edad2d0fbR1831

the problem is that they introduced WriteOperationHelperCallbacks and private implementation for it WriteOperationHelperCallbacksImpl which is required to be not null in case of s3 and should be passed to WriteOperationHelper constructor. So we can not create WriteOperationHelperCallbacksImpl since its constructor is private. Since it deals with AmazonS3 in theory we could have our own implementation however somehow we need AmazonS3 object itself to work with it within WriteOperationHelperCallbacks implementation. The issue is that all the setters are private Currently in PR there was null as a value for this, that's the reason of NPE

currently the only way I see is Calcite like hacks... So we just copy their class in Flink repo, change level of visibility for WriteOperationHelperCallbacksImpl and reuse it another way: they have getAmazonS3ClientForTesting marked as VisibleForTesting and I guess used only there however it allos to get access to AmasonS3 object. Once we have access to that object we could implement our own version of WriteOperationHelperCallbacks and use it. The drawback is that exactly this method they are going to remove soon as well

I noticed that Spark also have similar problem for the same reason and it is still not resolved https://issues.apache.org/jira/browse/SPARK-38958

MartijnVisser avatar Dec 28 '23 14:12 MartijnVisser

Hi, Thanks for the PR! Just kindly ping. Could we resolve this by the solution metioned by @MartijnVisser ? We tried to use positionable read with ByteBuffer on hadoop to imporve the performance but found it's only supported after 3.3.0.

masteryhx avatar Apr 22 '24 02:04 masteryhx

Could we resolve this by the solution

See the comment above; how would you resolve the situation?

MartijnVisser avatar Apr 22 '24 07:04 MartijnVisser

Most likely we'll have to rewrite/reimplement the entire Hadoop-based filesystem implementations, since they've moved from the V1 API to the V2 API for the AWS SDK. So either upgrade to 3.4.0 (that's where the upgrade was done), or the 3.3.7-aws version.

MartijnVisser avatar Apr 22 '24 07:04 MartijnVisser