spark-bigquery-connector Biqquery changes not reflecting in the Streaming Pipeline

Biqquery changes not reflecting in the Streaming Pipeline

Open nishant-neo opened this issue 1 year ago • 2 comments

I have a spark-structured streaming pipeline that triggers from a Kafka queue and enriching these events with a lookup from an underlying bigquery table (bqt) (this table in the bigquery is also getting updated, via a dataflow job)

Let's say when I deploy the above pipeline at time=t, there were 100 rows in bqt At time=t', #rows in bqt have increased to 120 rows, but my streaming pipeline is still seeing 100 rows, once I redeploy it, after time t', it starts seeing 120 rows.

Is this behavior expected?

(The whole point of using bq over files was reading the data dynamically)

May 09 '23 13:05 nishant-neo

Yes, the behavior is expected. When a DataFrame is created, it is tied to a BigQuery Storage ReadSession, which looks at a snapshot of the table once created.

How often do you need the data to be refreshed? After each period, do you care for the entire table or just for the added rows?

May 09 '23 16:05 davidrabinowitz

When a DataFrame is created, it is tied to a BigQuery Storage ReadSession, which looks at a snapshot of the table once created.

How can I reload this snapshot of the DataFrame, in the runtime?

How often do you need the data to be refreshed? After each period, do you care for the entire table or just for the added rows?

At every trigger, I expect to read the full table (with the new delta added as well)

May 10 '23 03:05 nishant-neo

spark-bigquery-connector spark-bigquery-connector copied to clipboard

Biqquery changes not reflecting in the Streaming Pipeline

spark-bigquery-connector
spark-bigquery-connector copied to clipboard