BigQueryIO : control StorageWrite parallelism in batch, by reshuffling before write on the number of streams set for BigQueryIO.write() using .withNumStorageWriteApiStreams(numStorageWriteApiStreams)
BigQueryIO : control StorageWrite parallelism in batch, by reshuffling before write on the number of streams set for BigQueryIO.write() using .withNumStorageWriteApiStreams(numStorageWriteApiStreams)
- BigQueryIO .java - add documentation on how withNumStorageWriteApiStreams is supported in batch
- StorageApiLoads.java - implement a redistribute step in batch with numStorageWriteApiStreams shards
- BigQueryIOLT.java :
- expose experiments, numStorageWriteApiStreams, storageWriteApiTriggeringFrequencySec in the test configuration
- add gradle triggering example
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
- [ ] Mention the appropriate issue in your description (for example:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead. - [ ] Update
CHANGES.mdwith noteworthy changes. - [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.
See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers
Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:
R: @Abacn for label java. R: @ahmedabu98 for label io.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
The PR bot will only process comments in the main thread (not review comments).
Why the change:
- the BigQueryIO.write using StorageWrite can control the number of streams in streaming
- in batch the number of connections will be proportional with the paralellism of the job (which can vary based on the source). Users can hit the CreateWriteStreams quota (10,000 streams every hour, per project per region) stream creation (can see in monitoring google.cloud.bigquery.storage.v1.BigQueryWrite.CreateWriteStream 4xx). The quota depletion might not impact the job that used a lot of it but the following jobs during the 1h window.
The modified BigQqueryIO will inject a redistribute step in batch if withNumStorageWriteApiStreams > 0, which limits the number of CreateWriteStreams by the StorageApiWriteUnsharded step. This makes the same pipeline behave similarly in both steaming & batch.
PS: using the StorageApiWriteSharded (made for streaming) in batch is an unsupported workaround that has even lower performance (and higher cost) in my tests due to multiple shuffles that are done.
BigQueryIOLT has been modified to expose the parameters needed to test quota depletion. Setting withNumStorageWriteApiStreams very high will deplete the quota fast on a large test.
An extra redistribute step, comes with a cost in speed, but gives control over the quota consumption.
- Line 7 is a test where the pipeline is modified to have a redistribute step by the user.
- Line 8 is a test where the pipeline is modified by the BQIO.write using withNumStorageWriteApiStreams = 4096
Reviewers are already assigned to this PR: @Abacn @ahmedabu98
FYI - this could also be done outside of the transform, correct? The user could simply call Redistribute before calling BigQueryIO.write
Run Java_GCP_IO_Direct PreCommit
Run Java_GCP_IO_Direct PreCommit
Reminder, please take a look at this pr: @Abacn @ahmedabu98
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:
R: @damondouglas for label java. R: @damondouglas for label io.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
Reminder, please take a look at this pr: @damondouglas @damondouglas
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:
R: @kennknowles for label java. R: @ahmedabu98 for label io.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
Reminder, please take a look at this pr: @kennknowles @ahmedabu98
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:
R: @Abacn for label java. R: @johnjcasey for label io.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
Reminder, please take a look at this pr: @Abacn @johnjcasey
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:
R: @damondouglas for label java. R: @damondouglas for label io.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
Reminder, please take a look at this pr: @damondouglas @damondouglas
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:
R: @Abacn for label java. R: @Abacn for label io.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)