beam icon indicating copy to clipboard operation
beam copied to clipboard

Estimate req size in BQ streaming when auto sharding is true

Open quentin-sommer opened this issue 6 months ago • 2 comments

  • fixes #27363
  • fixes #34270

I came from #34270. I found this google doc page

SSLEOFError (EOF occurred in violation of protocol) Python This error returns instead of a 413 (ENTITY_TOO_LARGE) HTTP error. Reduce the size of the request.

The error is counted as a ConnectionError by the gcp client and automatically retried forever.

Checking how beam handles the bigquery streaming inserts records I realized there is a mechanism but it doesn't run when auto sharding is on. This adds the same behaviour to both contexts. It means it's possible batches are split into smaller batches when their estimated size is too big.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • [x] Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • [ ] Update CHANGES.md with noteworthy changes.
  • [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels Python tests Java tests Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

quentin-sommer avatar Jun 09 '25 15:06 quentin-sommer

@ahmedabu98 @liferoad pinging you both as you were active on the two source issues. let me know if you know someone better to review this

quentin-sommer avatar Jun 11 '25 01:06 quentin-sommer

Assigning reviewers:

R: @damccorm for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions[bot] avatar Jun 11 '25 02:06 github-actions[bot]

Reminder, please take a look at this pr: @damccorm

github-actions[bot] avatar Jun 19 '25 12:06 github-actions[bot]

R: @ahmedabu98

damccorm avatar Jun 20 '25 12:06 damccorm

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

github-actions[bot] avatar Jun 20 '25 12:06 github-actions[bot]

Please fix the lint error: ************* Module apache_beam.io.gcp.bigquery_test apache_beam/io/gcp/bigquery_test.py:2045:0: C0301: Line too long (94/80) (line-too-long) apache_beam/io/gcp/bigquery_test.py:2070:0: C0301: Line too long (85/80) (line-too-long)

liferoad avatar Jun 21 '25 19:06 liferoad

When I run

pytest apache_beam/io/gcp/bigquery_test.py

I don't reproduce the error in the action log line 9576. do you have a suggestion?

_ PipelineBasedStreamingInsertTest.test_failure_in_some_rows_does_not_duplicate_1 _

The test is apache_beam/io/gcp/bigquery_test.py::PipelineBasedStreamingInsertTest::test_failure_in_some_rows_does_not_duplicate uses

quentin-sommer avatar Jun 23 '25 02:06 quentin-sommer

When I run

pytest apache_beam/io/gcp/bigquery_test.py

I don't reproduce the error in the action log line 9576. do you have a suggestion?

_ PipelineBasedStreamingInsertTest.test_failure_in_some_rows_does_not_duplicate_1 _

The test is apache_beam/io/gcp/bigquery_test.py::PipelineBasedStreamingInsertTest::test_failure_in_some_rows_does_not_duplicate uses

could be flaky or depends on the specific Python version. Retriggered the failed one.

liferoad avatar Jun 23 '25 14:06 liferoad