beam
beam copied to clipboard
[Improvement]: (Python BQ batch loads) Perform one copy job for a set of temp tables instead of one copy job per temp table
What would you like to happen?
For large Java BQ batch loads that require copy jobs, the temp tables created for a given destination are grouped up into a list of references and are later used as a source for a copy job. Ultimately, only one copy job is performed for a given table destination.
Contrast this with the Python implementation: one copy job is performed for each temp table, even if they all go to the same destination. The _insert_copy_job()
function used from bigquery_tools
only allows a single source. But this should be possible because the JobConfigurationTableCopy
from the internal client has a sourceTables
field. I have not tried this, but I assume it can accept a list of table references as sources.
Reducing multiple copy jobs down to one should improve the speed of large writes (less time wasted spinning up multiple jobs and waiting for them). This may also prevent partial writes in the event one copy job fails.
Issue Priority
Priority: 3
Issue Component
Component: io-py-gcp