beam icon indicating copy to clipboard operation
beam copied to clipboard

add a new IO named DataLakeIO (#23074)

Open zhangt-nhlab opened this issue 3 years ago • 6 comments

We developed a new IO named DataLakeIO, which support beam to read data from data lake (delta, iceberg, hudi), and write data to data lake(delta, icberg, hudi).

Because delta , iceberg and hudi does not provide enough java api to read and write, so we use spark datasouce api to read and write data in DataLakeIO. Therefore, the spark dependencies is needed.

BeamDeltaTest, BeamIcebergTest and BeamHudiTest show how to use the above features.

zhangt-nhlab avatar Sep 08 '22 02:09 zhangt-nhlab

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kileys for label java. R: @Abacn for label build. R: @johnjcasey for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions[bot] avatar Sep 08 '22 02:09 github-actions[bot]

Codecov Report

Merging #23075 (ac21df5) into master (e3ba8d8) will increase coverage by 0.00%. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #23075   +/-   ##
=======================================
  Coverage   73.58%   73.58%           
=======================================
  Files         716      716           
  Lines       95301    95301           
=======================================
+ Hits        70124    70125    +1     
+ Misses      23881    23880    -1     
  Partials     1296     1296           
Flag Coverage Δ
python 83.40% <ø> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdks/python/apache_beam/utils/interactive_utils.py 95.12% <0.00%> (-2.44%) :arrow_down:
...hon/apache_beam/runners/worker/bundle_processor.py 93.30% <0.00%> (-0.25%) :arrow_down:
sdks/go/pkg/beam/util/gcsx/gcs.go 27.41% <0.00%> (ø)
sdks/go/pkg/beam/artifact/stage.go 61.87% <0.00%> (ø)
sdks/go/pkg/beam/io/filesystem/util.go 96.29% <0.00%> (ø)
sdks/go/pkg/beam/io/filesystem/memfs/memory.go 96.15% <0.00%> (ø)
...ks/python/apache_beam/runners/worker/sdk_worker.py 89.09% <0.00%> (+0.15%) :arrow_up:
sdks/python/apache_beam/runners/direct/executor.py 97.01% <0.00%> (+0.54%) :arrow_up:
.../python/apache_beam/transforms/periodicsequence.py 100.00% <0.00%> (+1.61%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Sep 08 '22 03:09 codecov[bot]

@kileys @Abacn @johnjcasey

zhangt-nhlab avatar Sep 08 '22 06:09 zhangt-nhlab

Reminder, please take a look at this pr: @kileys @Abacn @johnjcasey

github-actions[bot] avatar Sep 16 '22 12:09 github-actions[bot]

waiting on author

Abacn avatar Sep 16 '22 14:09 Abacn

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java. R: @damccorm for label build. R: @pabloem for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Sep 21 '22 12:09 github-actions[bot]

First of all - thanks for your contribution!

Before proceeding to review from my side, I'd like to know if there is a design doc or similar for this IO connector? It would be very helpful to understand the goals and the implementation of this connector in advance.

Also, several notes that are worth to mention:

  1. Please, create a new github issue for this feature.
  2. Please, avoid merging a master branch into your feature branch. Use git rebase instead.
  3. Run ./gradlew :sdks:java:io:datalake:check locally before pushing your changes to origin.

You can find a Beam contribution guide here: https://beam.apache.org/contribute/get-started-contributing/

zhangt-nhlab avatar Sep 26 '22 00:09 zhangt-nhlab

First of all - thanks for your contribution! Before proceeding to review from my side, I'd like to know if there is a design doc or similar for this IO connector? It would be very helpful to understand the goals and the implementation of this connector in advance. Also, several notes that are worth to mention:

  1. Please, create a new github issue for this feature.
  2. Please, avoid merging a master branch into your feature branch. Use git rebase instead.
  3. Run ./gradlew :sdks:java:io:datalake:check locally before pushing your changes to origin.

You can find a Beam contribution guide here: https://beam.apache.org/contribute/get-started-contributing/

Thank you for your reply! I will make my changes, and create a new github issue later.

zhangt-nhlab avatar Sep 26 '22 01:09 zhangt-nhlab

Was there any progress on getting this IO into Beam?

aaltay avatar Apr 25 '24 21:04 aaltay