dbt-athena icon indicating copy to clipboard operation
dbt-athena copied to clipboard

[Feature] add 'force_batch_from_source=true' to batch right from the start to allow handling large source and large target tables

Open pransito opened this issue 6 months ago • 14 comments

Is this your first time submitting a feature request?

  • [X] I have searched the existing issues, and I could not find an existing issue for this feature

Describe the feature

When one has a very large source table and one has a query on that source table that leads to a large target table then one can run into query timeout problems on Athena.

force_batch=true is good for handling many partitions and not run into the limit of 100 open partitions. However, in force_batch=true the first thing that happens is that the model gets written without partitions into a temp table. If that table is very big, that can time out, because that can be a query that means copying in one go many many GB of data.

In that case (or in fact, always) the partitions should be leveraged right from the get go. I.e. each partition should be used right from the start and written into the target (staging) table (using threading).

I believe THIS CODE must be adapted.

I am not sure if force_batch=true should be reworked to show the above behavior always or if a new flag force_batch_from_source=true should be introduced.

Describe alternatives you've considered

No response

Who will this benefit?

This will benefit people that work with very large partitioned Athena tables (100GB plus) that they want to write to another Athena table that can have on top many partitions.

Are you interested in contributing this feature?

Yes.

Anything else?

No response

pransito avatar Aug 05 '24 12:08 pransito