beam icon indicating copy to clipboard operation
beam copied to clipboard

BigQueryIO uniformize direct and export reads

Open RustedBones opened this issue 1 year ago • 13 comments
trafficstars

Refers to #26329, also fix #20100, #21076

When using readWithDatumReader and DIRECT_READ method, the transform would fail because the parseFn is expected. Refactor the IO so the avro datumReader can be use in both cases.

In some case, it is required to get the data with the desired schema. Currently, BQ io always uses the writer schema (or table schema). Create new APIs to set the reader schema.

This refactoring contains some breaking changes:

withFormat is not exposed anymore. Indeed, it is not possible to configure a TypedRead with a DatumReaderFactory and change the format later. Data format MUST be chosen when creating the transform.

In the TypedRead.Builder, replace the DatumReaderFactory with the BigQueryReaderFactory allowing to handle both avro and arrow in uniform fashion. This alters the BigQueryIOTranslation. I need some help on that point to handle that in a better way.

RustedBones avatar Aug 29 '24 13:08 RustedBones

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

github-actions[bot] avatar Aug 29 '24 14:08 github-actions[bot]

assign set of reviewers

RustedBones avatar Aug 30 '24 11:08 RustedBones

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java. R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions[bot] avatar Aug 30 '24 11:08 github-actions[bot]

Some BQ integration tests are failing.

I don't know schema & data of the following big_query_import_export.parallel_read_table_row_xxx tables so I can recreated the setup in a personal GCP project. Can someone give me a hand ?

RustedBones avatar Aug 30 '24 11:08 RustedBones

Reminder, please take a look at this pr: @robertwb @ahmedabu98

github-actions[bot] avatar Sep 11 '24 12:09 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damondouglas for label java. R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Sep 16 '24 12:09 github-actions[bot]

Thanks for the change! I agree with the direct and export reads should be similar, ideally make the migration seemlessly from one to another. Going forward, I would suggest we open a few smaller PRs incrementally, and I can take care of it for those does not involve breaking change. For those does cc @ahmedabu98 (who also assigned as reviewer) what is your thoughts? Probably could also open a thread in Beam devlist.

we can keep this PR open for reference.

Abacn avatar Sep 16 '24 13:09 Abacn

I understand. I've been pulling many strings to achieve the original goal. I can split this PR in several chunks

RustedBones avatar Sep 17 '24 11:09 RustedBones

Reminder, please take a look at this pr: @damondouglas @Abacn

github-actions[bot] avatar Sep 28 '24 12:09 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java. R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Oct 02 '24 12:10 github-actions[bot]

Reminder, please take a look at this pr: @kennknowles @ahmedabu98

github-actions[bot] avatar Oct 09 '24 12:10 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java. R: @johnjcasey for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Oct 14 '24 12:10 github-actions[bot]

waiting on author

ahmedabu98 avatar Oct 14 '24 12:10 ahmedabu98

Codecov Report

Attention: Patch coverage is 62.67606% with 106 lines in your changes missing coverage. Please review.

Project coverage is 44.23%. Comparing base (1ba33b8) to head (2b5b356). Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
...rg/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java 52.17% 59 Missing and 7 partials :warning:
...eam/sdk/io/gcp/bigquery/BigQueryIOTranslation.java 20.58% 24 Missing and 3 partials :warning:
...eam/sdk/io/gcp/bigquery/BigQueryReaderFactory.java 91.11% 3 Missing and 1 partial :warning:
...dk/io/gcp/bigquery/BigQueryStorageArrowReader.java 69.23% 4 Missing :warning:
...sdk/io/gcp/bigquery/BigQueryStorageAvroReader.java 86.66% 1 Missing and 1 partial :warning:
...k/io/gcp/bigquery/BigQueryStorageStreamSource.java 85.71% 1 Missing and 1 partial :warning:
...ry/StorageApiDynamicDestinationsGenericRecord.java 0.00% 1 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #32360      +/-   ##
============================================
+ Coverage     34.67%   44.23%   +9.56%     
- Complexity        0     3121    +3121     
============================================
  Files           372      730     +358     
  Lines         76997   105748   +28751     
  Branches          0     3360    +3360     
============================================
+ Hits          26696    46775   +20079     
- Misses        48484    55627    +7143     
- Partials       1817     3346    +1529     
Flag Coverage Δ
java 69.83% <62.67%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Oct 22 '24 08:10 codecov[bot]

@ahmedabu98 this has been cleaned + updated. It's ready for review

RustedBones avatar Oct 24 '24 07:10 RustedBones

Hey @RustedBones, I'll be away for the next week or so but I can take a look when I come back

CC @Abacn if you have bandwidth to take a look

ahmedabu98 avatar Oct 25 '24 20:10 ahmedabu98

Reminder, please take a look at this pr: @Abacn @johnjcasey

github-actions[bot] avatar Nov 02 '24 12:11 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java. R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Nov 06 '24 12:11 github-actions[bot]

Reminder, please take a look at this pr: @kennknowles @ahmedabu98

github-actions[bot] avatar Dec 04 '24 12:12 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java. R: @chamikaramj for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Dec 09 '24 12:12 github-actions[bot]

I did some refactoring, reducing the breaking changes and allowing an easier transform upgrade

RustedBones avatar Dec 10 '24 13:12 RustedBones

Reminder, please take a look at this pr: @robertwb @chamikaramj

github-actions[bot] avatar Dec 24 '24 12:12 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java. R: @damondouglas for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Dec 27 '24 12:12 github-actions[bot]

Reminder, please take a look at this pr: @kennknowles @damondouglas

github-actions[bot] avatar Jan 04 '25 12:01 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java. R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Jan 08 '25 12:01 github-actions[bot]

waiting on author

Abacn avatar Jan 09 '25 18:01 Abacn

Looks like this is a evolving PR, please let me know when it's ready. Also, good to split independent fix out so it expedites review. Thank you!

Abacn avatar Jan 09 '25 18:01 Abacn

This is not evolving anymore. I rebased many times the PR to re-trigger the CI, in the hope to get all green checks in CI, without success. Seems most of the time the error is unrelated to the changes.

Will merge master in the future. It will be visible on your end that no changes are added.

RustedBones avatar Jan 10 '25 08:01 RustedBones

Reminder, please take a look at this pr: @robertwb @Abacn

github-actions[bot] avatar Jan 17 '25 12:01 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java. R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Jan 21 '25 12:01 github-actions[bot]