BigQuery OSS package elementary processing to much data with long history or frequent runs
Is your feature request related to a problem? Please describe.
I'm using the open source package elementary in BigQuery, when the schedule of the dbt run/test/build etc. it's frequent (ie I'm talking every 15 minutes) since the elementary tables aren't clustered or partitioned, the MERGE (default incremental strategy) becomes very expensive.
Describe the solution you'd like
I'd like elementary not to process so much data when appending new dbt runs metadata
Describe alternatives you've considered
There are a few options here that might work
-
simple: add a cluster key = the unique key for the elementary models when the adapter is BigQuery ie. add to the model config cluster_by only when the adapter is BigQuery
-
more complex: add an incremental_predicates that only scans current day if duplicates can happen, but still this would require the model to be partitioned on a specific date, which at the moment is not happening. Having the models to be partitioned on a date might be beneficial for using/storing the elementary tables as well.
-
more complex: is it possible for the model to be append only: create a different incremental strategy for BigQuery that is append only or take advantage of the fact that merge without unique key behaves like append in BigQuery: anyway there shouldn't be duplicates? still good idea to add partitions on a date, so that querying the target table is not too expensive.
Additional context
N/A can provide estimate of the processed data on the merge I'm using version 0.16.0 + dbt-bigquery and dbt-core 1.8.*
Would you be willing to contribute this feature?
Yes I'd be happy to contribute, but not sure what's the best course of action, maybe option 1. is the simplest (easier to implement), but 2/3 sounds better, considering BigQuery best practices.
Hey @marzaccaro ! Thanks for raising this.
I understand what you're explaining, and I think the best practice in this case would be to actually change the cadence of the Elementary on-run-end hooks. You can disable them this way and run dbt run --select elementary when you want them to run (say, hourly). 😄
This issue is stale because it has been open for too long with no activity. If you would like the issue to remain open, please remove the stale label or leave a comment.
This issue was closed because it has been inactive for too long while being marked as stale. If you would like the issue to reopen, please leave a comment.