add a new IO named DataLakeIO (#23074)
We developed a new IO named DataLakeIO, which support beam to read data from data lake (delta, iceberg, hudi), and write data to data lake(delta, icberg, hudi).
Because delta , iceberg and hudi does not provide enough java api to read and write, so we use spark datasouce api to read and write data in DataLakeIO. Therefore, the spark dependencies is needed.
BeamDeltaTest, BeamIcebergTest and BeamHudiTest show how to use the above features.
Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:
R: @kileys for label java. R: @Abacn for label build. R: @johnjcasey for label io.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
The PR bot will only process comments in the main thread (not review comments).
Codecov Report
Merging #23075 (ac21df5) into master (e3ba8d8) will increase coverage by
0.00%. The diff coverage isn/a.
@@ Coverage Diff @@
## master #23075 +/- ##
=======================================
Coverage 73.58% 73.58%
=======================================
Files 716 716
Lines 95301 95301
=======================================
+ Hits 70124 70125 +1
+ Misses 23881 23880 -1
Partials 1296 1296
| Flag | Coverage Δ | |
|---|---|---|
| python | 83.40% <ø> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| sdks/python/apache_beam/utils/interactive_utils.py | 95.12% <0.00%> (-2.44%) |
:arrow_down: |
| ...hon/apache_beam/runners/worker/bundle_processor.py | 93.30% <0.00%> (-0.25%) |
:arrow_down: |
| sdks/go/pkg/beam/util/gcsx/gcs.go | 27.41% <0.00%> (ø) |
|
| sdks/go/pkg/beam/artifact/stage.go | 61.87% <0.00%> (ø) |
|
| sdks/go/pkg/beam/io/filesystem/util.go | 96.29% <0.00%> (ø) |
|
| sdks/go/pkg/beam/io/filesystem/memfs/memory.go | 96.15% <0.00%> (ø) |
|
| ...ks/python/apache_beam/runners/worker/sdk_worker.py | 89.09% <0.00%> (+0.15%) |
:arrow_up: |
| sdks/python/apache_beam/runners/direct/executor.py | 97.01% <0.00%> (+0.54%) |
:arrow_up: |
| .../python/apache_beam/transforms/periodicsequence.py | 100.00% <0.00%> (+1.61%) |
:arrow_up: |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
@kileys @Abacn @johnjcasey
Reminder, please take a look at this pr: @kileys @Abacn @johnjcasey
waiting on author
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:
R: @robertwb for label java. R: @damccorm for label build. R: @pabloem for label io.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
First of all - thanks for your contribution!
Before proceeding to review from my side, I'd like to know if there is a design doc or similar for this IO connector? It would be very helpful to understand the goals and the implementation of this connector in advance.
Also, several notes that are worth to mention:
- Please, create a new github issue for this feature.
- Please, avoid merging a
masterbranch into your feature branch. Usegit rebaseinstead.- Run
./gradlew :sdks:java:io:datalake:checklocally before pushing your changes to origin.You can find a Beam contribution guide here: https://beam.apache.org/contribute/get-started-contributing/
First of all - thanks for your contribution! Before proceeding to review from my side, I'd like to know if there is a design doc or similar for this IO connector? It would be very helpful to understand the goals and the implementation of this connector in advance. Also, several notes that are worth to mention:
- Please, create a new github issue for this feature.
- Please, avoid merging a
masterbranch into your feature branch. Usegit rebaseinstead.- Run
./gradlew :sdks:java:io:datalake:checklocally before pushing your changes to origin.You can find a Beam contribution guide here: https://beam.apache.org/contribute/get-started-contributing/
Thank you for your reply! I will make my changes, and create a new github issue later.
Was there any progress on getting this IO into Beam?