Allow Flink to run without Hadoop

This PR aims to remove Hadoop's Configuration class from the main code path, so we can also run Flink without having the Hadoop JARs on the Java Classpath.

Python 3.9.19 (main, Mar 19 2024, 16:08:27) 
[Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyflink.datastream import StreamExecutionEnvironment
>>> env = StreamExecutionEnvironment.get_execution_environment()
>>> 
>>> env.add_jars("file:///Users/fokkodriesprong/Desktop/iceberg/flink/v1.19/flink-runtime/build/libs/iceberg-flink-runtime-1.19-1.6.0-SNAPSHOT.jar")
>>> env.add_jars("file:///Users/fokkodriesprong/Desktop/iceberg/aws-bundle/build/libs/iceberg-aws-bundle-1.6.0-SNAPSHOT.jar")
>>> 
>>> from pyflink.table import StreamTableEnvironment
>>> 
>>> table_env = StreamTableEnvironment.create(env)
>>> 
>>> table_env.execute_sql("""
...   CREATE CATALOG tabular WITH (
...       'type'='iceberg',
...       'catalog-type'='rest',
...       'uri'='https://api.tabular.io/ws/',
...       'credential'='abc',
...       'warehouse'='Fokko',
...       'io-impl'='org.apache.iceberg.aws.s3.S3FileIO',
...       'auth.default-refresh-enabled'='true'
...  )
... """).print()
OK
>>> 
>>> 
>>> table_env.execute_sql("USE CATALOG tabular").print()
OK
>>> table_env.execute_sql("SELECT * FROM examples.nyc_taxi_locations").print()
+----+-------------+--------------------------------+--------------------------------+
| op | location_id |                        borough |                      zone_name |
+----+-------------+--------------------------------+--------------------------------+
| +I |           1 |                            EWR |                 Newark Airport |
| +I |           2 |                         Queens |                    Jamaica Bay |
| +I |           3 |                          Bronx |        Allerton/Pelham Gardens |
...
| +I |         265 |                        Unknown |                             NA |
+----+-------------+--------------------------------+--------------------------------+
265 rows in set

Testing

Testing is still pending. This PR focusses on read operations. For write operations, upstream changes need to be done to Parquet-MR. With the main focus on the ParquetWriter class: https://github.com/apache/parquet-mr/blob/4bf606905924896403d25cd6287399cfe7050ce9/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L25

Resolves #7332 Resolves #3117

Apr 18 '23 10:04 Fokko

@stevenzwu @pvary do you have time to get some eyes on this one?

Jun 28 '24 10:06 Fokko

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

Aug 29 '24 00:08 github-actions[bot]

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Sep 05 '24 00:09 github-actions[bot]

@Fokko I wonder why this is closed. I face this issue when submitting jobs to AWS managed Flink.

Jan 15 '25 17:01 skyahead

Same, for some reason Flink requires Hadoop and yet does not come with it? We're using the Kubernetes Operator and would like to avoid customizing the image to fix this bug.

Apr 09 '25 17:04 vbabenkoru

What was the last status for this PR, as in, how much work is left before it can be reviewed?

Jun 12 '25 08:06 gphilipp

Flink 1.19: Run without Hadoop

Allow Flink to run without Hadoop

Testing