dlt icon indicating copy to clipboard operation
dlt copied to clipboard

Optimize BigQuery Partitioned Table Incremental Queries

Open sabrikaragonen opened this issue 1 year ago • 5 comments

Feature description

I'm using delete+insert for a BigQuery task, and it's pretty expensive since it scans the whole table for deletes. The problem here is, BigQuery can't utilize partitions when the partition is calculated dynamically as in the use case

Are you a dlt user?

Yes, I run dlt in production.

Use case

delete+insert on a column called partition_col which is a timestamp/date/datetime column.

-- DLT DELETE INSERT
-- 33 GB
DELETE FROM `project.dataset.table` AS d 
WHERE EXISTS (SELECT 1 FROM `project.dataset_staging.table` AS s WHERE d.`partition_col` = s.`partition_col`);

Proposed solution

A solution to that is, assigning the partition values to a variable, and then deleting the values in that array.

-- ALTERNATIVE SOLUTION
-- 180 MB
declare partition_values array<timestamp> default (SELECT array_agg(distinct partition_col) FROM `project.dataset_staging.table` AS s);

DELETE FROM `project.dataset.table` AS d 
WHERE partition_col IN unnest(partition_values);

The potential problems of this solution which are not that big are:

  • We need to know the data type of incremental_key (timestamp in that example)
  • BigQuery has 100MB limit for a single cell, and since it's an array, it can't store more than 100MB, which is very rare to happen in incremental keys.

Related issues

No response

sabrikaragonen avatar Nov 08 '24 10:11 sabrikaragonen

@sabrikaragonen how does that work in case if MERGE statement? you can opt for upsert merge strategy. in case of single table (no nested tables) we do not do any DELETE.

anyway, what you point here is valid and we could try to optimize if partition hint is found. there's no way to put your solution in a single clause? that'd make the fix WAY easier

rudolfix avatar Nov 25 '24 11:11 rudolfix

Sorry, missed the question. It's still same for merge statement. Merge statement scans all the destination table.

Unfortunately there is no other way for now as far as I know.

Another solution would be time interval materialization:

Since dlt already knows the min/max values of the incremental column, we can use these values to delete from the table.

delete from `project.dataset.table` WHERE partition_col between '{{ start_date }}' and '{{ end_date }}';
insert into `project.dataset.table` select * from `project.dataset_staging.table`;

Here, the only problem is, if partition_col is date, you have to pass start_date and end_date as dates, which might be ok. However, if partition_col is a timestamp column, we might delete extra data. Alternative for that would be:

delete from `project.dataset.table` 
WHERE date(partition_col) between '{{ start_date }}' and '{{ end_date }}'
  and partition_col between '{{ start_timestamp }}' and '{{ end_timestamp }}';
insert into `project.dataset.table` select * from `project.dataset_staging.table`;

sabrikaragonen avatar Jun 04 '25 17:06 sabrikaragonen

@rudolfix hey, did you have a chance to think about this? It is causing significant costs in larger tables. Happy to contribute to an improvement if you can point us in the right direction.

AFAIK dlt already keeps track of distinct values in incremental columns, which means it could generate a much more efficient query already theoretically.

karakanb avatar Jul 02 '25 15:07 karakanb

@rudolfix Bumping this up, happy to contribute the implementation ourselves if you have any guidance on the best way to go for this.

karakanb avatar Jul 18 '25 11:07 karakanb

Has there been any progress here or is there any other workaround aside from doing manual deletes and using the "append" write disposition?

IcaroG avatar Oct 07 '25 13:10 IcaroG