[Bug]: BigQueryBatchFileLoads in python loses data when using WRITE_TRUNCATE
What happened?
When using WRITE_TRUNCATE, if the data being loaded is split into multiple temp tables, the copy job that merges all the results together can end up using WRITE_TRUNCATE for each copy job, resulting in only the last copy job "winning", and overwriting all other jobs.
It looks like there was an attempt to handle this here https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L545, but this code assumes that all inputs to TriggerCopyJobs arrive in the same bundle. From what I observed this is not the case, looking at the logs from that step, the log lines from that step have different "work" fields for copy job 1 and 2.
There probably needs to be a GBK before this step in order to make sure that all copy jobs actually are executed in the same unit?
Issue Priority
Priority: 1
Issue Component
Component: io-py-gcp