beam [Task]: Client-side throttling for BigQueryIO DIRECT

What needs to happen?

There are reports of BigQueryIO DIRECT_READ running out of quota in large Dataflow batch pipelines. Basically, there are both a per-project and a per-region quota number of active read stream. When this quota is drained, BigQuery backend starts to revoke the older streams that may still be active, causing read to fail.

On the other hand, Dataflow does not aware of the quota is burn, still keep to split streams and possibly bumping workers as the progress is slow, adding more stress on the quota

The task is to design and implement a mechanism to mitigate this issue in the short term, as a pivotal implementation for the generic client side throttling approach suggested in #24743

Issue Priority

Priority: 2 (default / most normal work should be filed as P2)

Issue Components

[ ] Component: Python SDK
[ ] Component: Java SDK
[ ] Component: Go SDK
[ ] Component: Typescript SDK
[ ] Component: IO connector
[ ] Component: Beam YAML
[ ] Component: Beam examples
[ ] Component: Beam playground
[ ] Component: Beam katas
[ ] Component: Website
[ ] Component: Spark Runner
[ ] Component: Flink Runner
[ ] Component: Samza Runner
[ ] Component: Twister2 Runner
[ ] Component: Hazelcast Jet Runner
[ ] Component: Google Cloud Dataflow Runner

Mar 15 '24 20:03 Abacn

This is a known issue and there was another approach trying to address it before, namely #24260. The current status of this approach is a pipeline option --enableStorageReadApiV2 to enroll in BigQuery storage Read API v2, where the read streams are the units of work instead of unit of parallelism, and the streams that created once won't be split further. This appraoch still need stress test to see if it mitigated the issue and to what extent.

This task is try to resolve the issue by client side throttling, alternative to aforementioned.

Mar 15 '24 20:03 Abacn

After #31096, the client side throttling now work with ( Storage read API v2 stream (#28778) + Dataflow legacy runner). There are still many caveats

For the default read API v1 stream, it appears the API call waiting on retry won't temporarily release the concurrent stream quota, so hasNext call can be blocked very long until the metrics get reported back to the work item thread. The pipeline do not upscale, but it stuck indefinitely (probably until exhausted retry)

Update: API v1 stream issue is due to (effective) deadlock of two synchronized block at

https://github.com/apache/beam/blob/673da546c1465c931fdbbc5769e7d566ff55b4d8/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryStorageStreamSource.java#L238

https://github.com/apache/beam/blob/673da546c1465c931fdbbc5769e7d566ff55b4d8/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryStorageStreamSource.java#L376

both will call hasNext. If splitAtFraction proceeded into synchronized block first, the work item thread will have to wait until it exits synchronized block, the symptom is then

Operation ongoing in step BigQueryIO.TypedRead/Read(BigQueryStorageTableSource) for at least 05m00s without outputting or completing in state process in thread pool-3-thread-2 with id 28
  at org.apache.beam.sdk.io.gcp.bigquery.BigQueryStorageStreamSource$BigQueryStorageStreamReader.advance(BigQueryStorageStreamSource.java:232)

if readNextRecord gets called first, splitAtFraction will take very long (probably causing issue then?)

Finally, metrics is not supported in the thread calling splitAtFraction (Worker status update thread), so reportingPendingMetrics there should be removed

=====

It seems also not effective on Dataflow runner v2

Apr 26 '24 20:04 Abacn

A Dataflow runner side issue identified and resolved. Close this as done.

Jun 11 '24 16:06 Abacn

[Task]: Client-side throttling for BigQueryIO DIRECT_READ mode

What needs to happen?

Issue Priority

Issue Components