spark-bigquery-connector
spark-bigquery-connector copied to clipboard
Biqquery changes not reflecting in the Streaming Pipeline
I have a spark-structured streaming pipeline that triggers from a Kafka queue and enriching these events with a lookup from an underlying bigquery table (bqt) (this table in the bigquery is also getting updated, via a dataflow job)
Let's say when I deploy the above pipeline at time=t, there were 100 rows in bqt At time=t', #rows in bqt have increased to 120 rows, but my streaming pipeline is still seeing 100 rows, once I redeploy it, after time t', it starts seeing 120 rows.
Is this behavior expected?
(The whole point of using bq over files was reading the data dynamically)
Yes, the behavior is expected. When a DataFrame is created, it is tied to a BigQuery Storage ReadSession, which looks at a snapshot of the table once created.
How often do you need the data to be refreshed? After each period, do you care for the entire table or just for the added rows?
When a DataFrame is created, it is tied to a BigQuery Storage ReadSession, which looks at a snapshot of the table once created.
How can I reload this snapshot of the DataFrame, in the runtime?
How often do you need the data to be refreshed? After each period, do you care for the entire table or just for the added rows?
At every trigger, I expect to read the full table (with the new delta added as well)