beam icon indicating copy to clipboard operation
beam copied to clipboard

BigQueryIO : control StorageWrite parallelism in batch, by reshuffling before write on the number of streams set for BigQueryIO.write() using .withNumStorageWriteApiStreams(numStorageWriteApiStreams)

Open razvanculea opened this issue 1 year ago • 3 comments

BigQueryIO : control StorageWrite parallelism in batch, by reshuffling before write on the number of streams set for BigQueryIO.write() using .withNumStorageWriteApiStreams(numStorageWriteApiStreams)

  • BigQueryIO .java - add documentation on how withNumStorageWriteApiStreams is supported in batch
  • StorageApiLoads.java - implement a redistribute step in batch with numStorageWriteApiStreams shards
  • BigQueryIOLT.java :
    • expose experiments, numStorageWriteApiStreams, storageWriteApiTriggeringFrequencySec in the test configuration
    • add gradle triggering example

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • [ ] Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • [ ] Update CHANGES.md with noteworthy changes.
  • [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels Python tests Java tests Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

razvanculea avatar Oct 16 '24 13:10 razvanculea

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

github-actions[bot] avatar Oct 16 '24 14:10 github-actions[bot]

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java. R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions[bot] avatar Oct 16 '24 21:10 github-actions[bot]

Why the change:

  • the BigQueryIO.write using StorageWrite can control the number of streams in streaming
  • in batch the number of connections will be proportional with the paralellism of the job (which can vary based on the source). Users can hit the CreateWriteStreams quota (10,000 streams every hour, per project per region) stream creation (can see in monitoring google.cloud.bigquery.storage.v1.BigQueryWrite.CreateWriteStream 4xx). The quota depletion might not impact the job that used a lot of it but the following jobs during the 1h window.

The modified BigQqueryIO will inject a redistribute step in batch if withNumStorageWriteApiStreams > 0, which limits the number of CreateWriteStreams by the StorageApiWriteUnsharded step. This makes the same pipeline behave similarly in both steaming & batch.

PS: using the StorageApiWriteSharded (made for streaming) in batch is an unsupported workaround that has even lower performance (and higher cost) in my tests due to multiple shuffles that are done.

BigQueryIOLT has been modified to expose the parameters needed to test quota depletion. Setting withNumStorageWriteApiStreams very high will deplete the quota fast on a large test.

An extra redistribute step, comes with a cost in speed, but gives control over the quota consumption.

  • Line 7 is a test where the pipeline is modified to have a redistribute step by the user.
  • Line 8 is a test where the pipeline is modified by the BQIO.write using withNumStorageWriteApiStreams = 4096 Screenshot 2024-10-17 at 09 47 10

razvanculea avatar Oct 17 '24 08:10 razvanculea

Reviewers are already assigned to this PR: @Abacn @ahmedabu98

github-actions[bot] avatar Oct 21 '24 13:10 github-actions[bot]

FYI - this could also be done outside of the transform, correct? The user could simply call Redistribute before calling BigQueryIO.write

reuvenlax avatar Oct 28 '24 17:10 reuvenlax

Run Java_GCP_IO_Direct PreCommit

stankiewicz avatar Oct 28 '24 19:10 stankiewicz

Run Java_GCP_IO_Direct PreCommit

razvanculea avatar Oct 29 '24 08:10 razvanculea

Reminder, please take a look at this pr: @Abacn @ahmedabu98

github-actions[bot] avatar Nov 05 '24 12:11 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damondouglas for label java. R: @damondouglas for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Nov 07 '24 12:11 github-actions[bot]

Reminder, please take a look at this pr: @damondouglas @damondouglas

github-actions[bot] avatar Nov 19 '24 12:11 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java. R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Nov 22 '24 12:11 github-actions[bot]

Reminder, please take a look at this pr: @kennknowles @ahmedabu98

github-actions[bot] avatar Nov 30 '24 12:11 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java. R: @johnjcasey for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Dec 04 '24 12:12 github-actions[bot]

Reminder, please take a look at this pr: @Abacn @johnjcasey

github-actions[bot] avatar Dec 12 '24 12:12 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damondouglas for label java. R: @damondouglas for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Dec 16 '24 12:12 github-actions[bot]

Reminder, please take a look at this pr: @damondouglas @damondouglas

github-actions[bot] avatar Dec 24 '24 12:12 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java. R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Dec 27 '24 12:12 github-actions[bot]