Optimize BigQuery Partitioned Table Incremental Queries
Feature description
I'm using delete+insert for a BigQuery task, and it's pretty expensive since it scans the whole table for deletes. The problem here is, BigQuery can't utilize partitions when the partition is calculated dynamically as in the use case
Are you a dlt user?
Yes, I run dlt in production.
Use case
delete+insert on a column called partition_col which is a timestamp/date/datetime column.
-- DLT DELETE INSERT
-- 33 GB
DELETE FROM `project.dataset.table` AS d
WHERE EXISTS (SELECT 1 FROM `project.dataset_staging.table` AS s WHERE d.`partition_col` = s.`partition_col`);
Proposed solution
A solution to that is, assigning the partition values to a variable, and then deleting the values in that array.
-- ALTERNATIVE SOLUTION
-- 180 MB
declare partition_values array<timestamp> default (SELECT array_agg(distinct partition_col) FROM `project.dataset_staging.table` AS s);
DELETE FROM `project.dataset.table` AS d
WHERE partition_col IN unnest(partition_values);
The potential problems of this solution which are not that big are:
- We need to know the data type of incremental_key (timestamp in that example)
- BigQuery has 100MB limit for a single cell, and since it's an array, it can't store more than 100MB, which is very rare to happen in incremental keys.
Related issues
No response
@sabrikaragonen how does that work in case if MERGE statement? you can opt for upsert merge strategy. in case of single table (no nested tables) we do not do any DELETE.
anyway, what you point here is valid and we could try to optimize if partition hint is found. there's no way to put your solution in a single clause? that'd make the fix WAY easier
Sorry, missed the question. It's still same for merge statement. Merge statement scans all the destination table.
Unfortunately there is no other way for now as far as I know.
Another solution would be time interval materialization:
Since dlt already knows the min/max values of the incremental column, we can use these values to delete from the table.
delete from `project.dataset.table` WHERE partition_col between '{{ start_date }}' and '{{ end_date }}';
insert into `project.dataset.table` select * from `project.dataset_staging.table`;
Here, the only problem is, if partition_col is date, you have to pass start_date and end_date as dates, which might be ok. However, if partition_col is a timestamp column, we might delete extra data. Alternative for that would be:
delete from `project.dataset.table`
WHERE date(partition_col) between '{{ start_date }}' and '{{ end_date }}'
and partition_col between '{{ start_timestamp }}' and '{{ end_timestamp }}';
insert into `project.dataset.table` select * from `project.dataset_staging.table`;
@rudolfix hey, did you have a chance to think about this? It is causing significant costs in larger tables. Happy to contribute to an improvement if you can point us in the right direction.
AFAIK dlt already keeps track of distinct values in incremental columns, which means it could generate a much more efficient query already theoretically.
@rudolfix Bumping this up, happy to contribute the implementation ourselves if you have any guidance on the best way to go for this.
Has there been any progress here or is there any other workaround aside from doing manual deletes and using the "append" write disposition?