dbt-clickhouse
dbt-clickhouse copied to clipboard
Strategy for breaking down the cost of joins
I had a discussion with Roy Rozenblum on the public Slack and he recommends an interesting approach to spread the cost of joins:
The idea is this: we need to split the RIGHT table into buckets. Assume TabA join TabB.
The formula for creating the buckets is pretty simple:
toInt64(<TabB primary key> /* maybe we should hash if key is big */) % <number_of_buckets>
as bucket.
Then we just loop the INSERT command <number_of_buckets>
times, with the bucket number as parameter.
There are some challenges for clickhouse:
We can't just add a WHERE condition of bucket = X
because clickhouse doesn't do smart pushdown yet. So we need to add a subselect with the bucket expression and WHERE clause to wrap around the RIGHT table name / model.
Also we need to make sure it's an INNER JOIN only (it won't work otherwise).
So I think the approach is to add a special JOIN Macro, so instead of writing the join in a regular way, the TabA join TabB on X=Y we put a Macro there the does all of the above.
So something like this (slightly simplified):
/* Assume TabB's primary key is B1, B2 */
select * from
{{ inner_join_with_buckets(TabA, TabB, no_of_buckets=16) }} on TabA.A2 = TabB.B3 and TabB.A3 = TabB.B4
where TabA.A1 = 'Whatever'
Would result in this:
select * from
TabA join
(select *, toInt64(someHash(Tuple(B1, B2))) % 16 as this_rows_bucket from TabB where this_rows_bucket = <PassCurrentBucketSomehow>) as TabB
on TabA.A2 = TabB.B3 and TabB.A3 = TabB.B4
where TabA.A1 = 'Whatever'
Hi @Ryado ,
Maybe as a first step, we can have support for splitting a multi-select SQL statement into several INSERTs. If that can be done, it would then be pretty easy to create Macros supporting the breaking up complicated SQLs into smaller ones.
Let me give an example: If I create a model which results in this SQL:
SELECT * from memoryIntensiveSubselect1 UNION ALL SELECT * from memoryIntensiveSubselect2 UNION ALL SELECT * from memoryIntensiveSubselect3 UNION ALL SELECT * from memoryIntensiveSubselect4 UNION ALL SELECT * from memoryIntensiveSubselect5
This would now result in 5 INSERTs.
We can have some sort of configuration switch that decides if we split UNION ALL to different INSERTs or not.
With that capability in place, the community could easily create macros that split joins this way, as suggested in your comment.