Flink 1.19: Run without Hadoop
Allow Flink to run without Hadoop
This PR aims to remove Hadoop's Configuration class from the main code path, so we can also run Flink without having the Hadoop JARs on the Java Classpath.
Python 3.9.19 (main, Mar 19 2024, 16:08:27)
[Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyflink.datastream import StreamExecutionEnvironment
>>> env = StreamExecutionEnvironment.get_execution_environment()
>>>
>>> env.add_jars("file:///Users/fokkodriesprong/Desktop/iceberg/flink/v1.19/flink-runtime/build/libs/iceberg-flink-runtime-1.19-1.6.0-SNAPSHOT.jar")
>>> env.add_jars("file:///Users/fokkodriesprong/Desktop/iceberg/aws-bundle/build/libs/iceberg-aws-bundle-1.6.0-SNAPSHOT.jar")
>>>
>>> from pyflink.table import StreamTableEnvironment
>>>
>>> table_env = StreamTableEnvironment.create(env)
>>>
>>> table_env.execute_sql("""
... CREATE CATALOG tabular WITH (
... 'type'='iceberg',
... 'catalog-type'='rest',
... 'uri'='https://api.tabular.io/ws/',
... 'credential'='abc',
... 'warehouse'='Fokko',
... 'io-impl'='org.apache.iceberg.aws.s3.S3FileIO',
... 'auth.default-refresh-enabled'='true'
... )
... """).print()
OK
>>>
>>>
>>> table_env.execute_sql("USE CATALOG tabular").print()
OK
>>> table_env.execute_sql("SELECT * FROM examples.nyc_taxi_locations").print()
+----+-------------+--------------------------------+--------------------------------+
| op | location_id | borough | zone_name |
+----+-------------+--------------------------------+--------------------------------+
| +I | 1 | EWR | Newark Airport |
| +I | 2 | Queens | Jamaica Bay |
| +I | 3 | Bronx | Allerton/Pelham Gardens |
...
| +I | 265 | Unknown | NA |
+----+-------------+--------------------------------+--------------------------------+
265 rows in set
Testing
Testing is still pending. This PR focusses on read operations. For write operations, upstream changes need to be done to Parquet-MR. With the main focus on the ParquetWriter class: https://github.com/apache/parquet-mr/blob/4bf606905924896403d25cd6287399cfe7050ce9/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L25
Resolves #7332 Resolves #3117
@stevenzwu @pvary do you have time to get some eyes on this one?
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.
@Fokko I wonder why this is closed. I face this issue when submitting jobs to AWS managed Flink.
Same, for some reason Flink requires Hadoop and yet does not come with it? We're using the Kubernetes Operator and would like to avoid customizing the image to fix this bug.
What was the last status for this PR, as in, how much work is left before it can be reviewed?