beam icon indicating copy to clipboard operation
beam copied to clipboard

[Improvement]: (Python BQ batch loads) Perform one copy job for a set of temp tables instead of one copy job per temp table

Open ahmedabu98 opened this issue 2 years ago • 3 comments

What would you like to happen?

For large Java BQ batch loads that require copy jobs, the temp tables created for a given destination are grouped up into a list of references and are later used as a source for a copy job. Ultimately, only one copy job is performed for a given table destination.

Contrast this with the Python implementation: one copy job is performed for each temp table, even if they all go to the same destination. The _insert_copy_job() function used from bigquery_tools only allows a single source. But this should be possible because the JobConfigurationTableCopy from the internal client has a sourceTables field. I have not tried this, but I assume it can accept a list of table references as sources.

Reducing multiple copy jobs down to one should improve the speed of large writes (less time wasted spinning up multiple jobs and waiting for them). This may also prevent partial writes in the event one copy job fails.

Issue Priority

Priority: 3

Issue Component

Component: io-py-gcp

ahmedabu98 avatar Sep 14 '22 19:09 ahmedabu98