dbt-athena
dbt-athena copied to clipboard
[Feature] add 'force_batch_from_source=true' to batch right from the start to allow handling large source and large target tables
Is this your first time submitting a feature request?
- [X] I have searched the existing issues, and I could not find an existing issue for this feature
Describe the feature
When one has a very large source table and one has a query on that source table that leads to a large target table then one can run into query timeout problems on Athena.
force_batch=true
is good for handling many partitions and not run into the limit of 100 open partitions. However, in force_batch=true
the first thing that happens is that the model gets written without partitions into a temp table. If that table is very big, that can time out, because that can be a query that means copying in one go many many GB of data.
In that case (or in fact, always) the partitions should be leveraged right from the get go. I.e. each partition should be used right from the start and written into the target (staging) table (using threading).
I believe THIS CODE must be adapted.
I am not sure if force_batch=true
should be reworked to show the above behavior always or if a new flag force_batch_from_source=true
should be introduced.
Describe alternatives you've considered
No response
Who will this benefit?
This will benefit people that work with very large partitioned Athena tables (100GB plus) that they want to write to another Athena table that can have on top many partitions.
Are you interested in contributing this feature?
Yes.
Anything else?
No response