Estimate req size in BQ streaming when auto sharding is true
- fixes #27363
- fixes #34270
I came from #34270. I found this google doc page
SSLEOFError (EOF occurred in violation of protocol) Python This error returns instead of a 413 (ENTITY_TOO_LARGE) HTTP error. Reduce the size of the request.
The error is counted as a ConnectionError by the gcp client and automatically retried forever.
Checking how beam handles the bigquery streaming inserts records I realized there is a mechanism but it doesn't run when auto sharding is on. This adds the same behaviour to both contexts. It means it's possible batches are split into smaller batches when their estimated size is too big.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
- [x] Mention the appropriate issue in your description (for example:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead. - [ ] Update
CHANGES.mdwith noteworthy changes. - [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.
See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.
@ahmedabu98 @liferoad pinging you both as you were active on the two source issues. let me know if you know someone better to review this
Assigning reviewers:
R: @damccorm for label python.
Note: If you would like to opt out of this review, comment assign to next reviewer.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
The PR bot will only process comments in the main thread (not review comments).
Reminder, please take a look at this pr: @damccorm
R: @ahmedabu98
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers
Please fix the lint error: ************* Module apache_beam.io.gcp.bigquery_test apache_beam/io/gcp/bigquery_test.py:2045:0: C0301: Line too long (94/80) (line-too-long) apache_beam/io/gcp/bigquery_test.py:2070:0: C0301: Line too long (85/80) (line-too-long)
When I run
pytest apache_beam/io/gcp/bigquery_test.py
I don't reproduce the error in the action log line 9576. do you have a suggestion?
_ PipelineBasedStreamingInsertTest.test_failure_in_some_rows_does_not_duplicate_1 _
The test is apache_beam/io/gcp/bigquery_test.py::PipelineBasedStreamingInsertTest::test_failure_in_some_rows_does_not_duplicate uses
When I run
pytest apache_beam/io/gcp/bigquery_test.pyI don't reproduce the error in the action log line 9576. do you have a suggestion?
_ PipelineBasedStreamingInsertTest.test_failure_in_some_rows_does_not_duplicate_1 _The test is
apache_beam/io/gcp/bigquery_test.py::PipelineBasedStreamingInsertTest::test_failure_in_some_rows_does_not_duplicateuses
could be flaky or depends on the specific Python version. Retriggered the failed one.