[Bug]: BigQueryBatchFileLoads in python loses data when using WRITE_TRUNCATE

Open steveniemitz opened this issue 3 years ago • 0 comments

What happened?

When using WRITE_TRUNCATE, if the data being loaded is split into multiple temp tables, the copy job that merges all the results together can end up using WRITE_TRUNCATE for each copy job, resulting in only the last copy job "winning", and overwriting all other jobs.

It looks like there was an attempt to handle this here https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L545, but this code assumes that all inputs to TriggerCopyJobs arrive in the same bundle. From what I observed this is not the case, looking at the logs from that step, the log lines from that step have different "work" fields for copy job 1 and 2.

There probably needs to be a GBK before this step in order to make sure that all copy jobs actually are executed in the same unit?

Issue Priority

Priority: 1

Issue Component

Component: io-py-gcp

Sep 20 '22 14:09 steveniemitz