[FLINK-30975] Upgrade hadoop to 3.4.X
What is the purpose of the change
Hadoop 3.4.X introduced 2500+ bug fixes and support for AWS SDK V2. Note AWS SDK V1 is EOL 12/31/2025.
One challenge is maintaining support for presto which is using AWS SDK V1 and is not updated just yet, While Hadoop 3.4.X has support for AWS SDK V1 there are a couple wrapper classes need to support the SDK changes and maintain support for Presto's AWS SDK V1 and support Hadoop's AWS SDK V2 upgrade.
Brief change log
- update hadoop to 3.4.2
- provide wrapper classes to support AWS SDK V1/V2
Verifying this change
This change is already covered by existing tests and adds new test coverage:
Existing tests: - All existing unit tests for flink-s3-fs-hadoop (HadoopS3FileSystemTest, HadoopS3FileSystemsSchemesTest) pass with Hadoop 3.4.2 and AWS SDK V2 - All existing unit tests for flink-s3-fs-presto continue to pass with AWS SDK V1 - Integration tests (HAJobRunOnHadoopS3FileSystemITCase, S5CmdOnHadoopS3FileSystemITCase) verify S3 functionality with new SDK
New/updated tests:
- Converted S3FileSystemMinioTest and PrestoS3FileSystemMinioTest E2E tests to JUnit framework
- Both tests verify write, read, and delete operations against MinIO (S3-compatible storage)
- Tests confirm that both Hadoop (SDK V2) and Presto (SDK V1) filesystems work correctly
Manual verification:
- Verified AWS SDK V1 is completely removed from flink-s3-fs-hadoop JAR (0 classes from com.amazonaws.*)
- Verified AWS SDK V1 remains in flink-s3-fs-presto JAR as expected
- Confirmed multipart upload operations work correctly with new HadoopS3AccessHelper
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): yes (Hadoop 3.3.6 → 3.4.2, adds AWS SDK V2 to Hadoop module)
- The public API, i.e., is any changed class annotated with
@Public(Evolving): no - The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
- The S3 file system connector: yes (Hadoop S3 filesystem now uses AWS SDK V2)
Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? not applicable (this is a dependency upgrade and internal refactoring)
CI report:
- d3d9fae147163eeb4fb6e655f435c650cb094311 Azure: FAILURE
Bot commands
The @flinkbot bot supports the following commands:@flinkbot run azurere-run the last Azure build
I am not in a position to approve this as I do not know the area. The title says upgrade hadoop 3.4.x - I am not sure what backports you could do - I assume you would want to deprecate the existing hadoop version and add the new one.
This upgrade won't address the concerns from https://github.com/apache/flink/pull/23844#issuecomment-1871212810 - I don't think we can't move forward with this one right now.
@MartijnVisser I was thinking about working through this by creating FlinkS3AFileSystem which provides access to the s3Client, not a huge fan of this(just work around), what do you think?