ecto_ch icon indicating copy to clipboard operation
ecto_ch copied to clipboard

drop hints support

Open ruslandoga opened this issue 2 years ago • 17 comments
trafficstars

It doesn't seem like ClickHouse supports hints. In Plausible query.hints are used as a workaround to pass the SAMPLE clause.

Maybe it can be similar to input/1 helper function:

q = from t in sample("table", 0.1), select: count(t.id) * 10
# select count(t.id) * 10 from table t sample 0.1

ruslandoga avatar Sep 19 '23 08:09 ruslandoga

I use them to add FINAL as I wrote here https://github.com/plausible/ecto_ch/pull/103/files (which I will update soon :) )

hkrutzer avatar Sep 19 '23 10:09 hkrutzer

It doesn't seem like FINAL is a hint though. I think ClickHouse is using settings in place of hints: https://github.com/ClickHouse/ClickHouse/issues/11719#issuecomment-646027073

ruslandoga avatar Sep 19 '23 10:09 ruslandoga

Indeed it's not a hint but it seems to be the only way to pass it into that query from Ecto. I don't think Ecto supports a settings clause?

hkrutzer avatar Sep 19 '23 10:09 hkrutzer

It doesn't have to go into the Ecto query, settings can be passed separately from the query.

ruslandoga avatar Sep 19 '23 10:09 ruslandoga

I think final still has some issues because it cannot move always the where clause to prewhere. See https://github.com/ClickHouse/ClickHouse/issues/31411

So I would advise to not use/promote the usage of final too much.

Zarathustra2 avatar Sep 19 '23 10:09 Zarathustra2

Some queries cannot be performed without FINAL but I think that is tangential to this issue.

hkrutzer avatar Sep 19 '23 10:09 hkrutzer

Really? Do you have an example? I am curious because I cannot think of an example maybe because I never encountered such a situation

Zarathustra2 avatar Sep 19 '23 10:09 Zarathustra2

When using an AggregatingMergeTree and you need the merged result for example

hkrutzer avatar Sep 19 '23 10:09 hkrutzer

You can use aggregating functions - sum, avg, argMaxMerge, etc. - + group by to have the same effect as your final query

Zarathustra2 avatar Sep 19 '23 10:09 Zarathustra2

I don't use it like that, I merge several partial records into one final record, like this (simplified):

`id` FixedString(26),
`started_at` DateTime('UTC') MATERIALIZED ULIDStringToDateTime(id),
`disconnected_at` SimpleAggregateFunction(max, DateTime('UTC')),
`user_responded_at` SimpleAggregateFunction(max, DateTime('UTC')),
`properties` SimpleAggregateFunction(groupArrayArray, Array(String))

Additional properties (in the properties column) can be added to the record at any time, and the disconnected_at and user_responded_at are inserted later as well. They are eventually merged based on the ID column.

hkrutzer avatar Sep 19 '23 10:09 hkrutzer

But can't you just do

select id, started_at, max(disconnected_at) as disconnected_at, max(user_responded_at) as user_responded_at, groupArrayArray(properties) as properties from <table> group by id, started_at

Maybe I am not really following. Still I don't think there is any query which you can only run with final and not as a group by + aggregation.

Zarathustra2 avatar Sep 19 '23 10:09 Zarathustra2

I don't know if I can, but why would I do it that way? That is a way longer query. Perhaps you are right and FINAL is not strictly speaking necessary but I don't see the advantage of writing a longer query.

hkrutzer avatar Sep 19 '23 11:09 hkrutzer

It depends on your use case. If you have a small dataset final should be fine. If you have a larger dataset final should be avoided. If you don't run into performance issues or don't care about speed then final is also fine.

Zarathustra2 avatar Sep 19 '23 11:09 Zarathustra2

If I query with a group by, I get

10 rows in set. Elapsed: 6.001 sec. Processed 33.86 million rows, 802.51 MB (5.64 million rows/s., 133.73 MB/s.)

and with the FINAL modifier:

10 rows in set. Elapsed: 0.584 sec. Processed 94.00 thousand rows, 110.76 KB (160.89 thousand rows/s., 189.56 KB/s.)

With GROUP BY, Clickhouse will probably scan every row, while FINAL can be performed in a more efficient way.

When I repeat the query it is even worse:

10 rows in set. Elapsed: 6.421 sec. Processed 33.86 million rows, 802.52 MB (5.27 million rows/s., 124.98 MB/s.)

remains.

With FINAL:

10 rows in set. Elapsed: 0.096 sec. Processed 94.01 thousand rows, 111.21 KB (982.83 thousand rows/s., 1.16 MB/s.)

So this is over 70 times slower and it looks like it will slow down further with additional records.

Group by

select conversation_id, max(disconnected_at) as disconnected_at, max(user_responded_at) as user_responded_at from conversations group by conversation_id limit 10 format Vertical

final

select conversation_id, disconnected_at, user_responded_at from conversations final limit 10 format Vertical

hkrutzer avatar Sep 19 '23 11:09 hkrutzer

I mean 0.584 isn't bad depending on use case but also I dont see the full query & table so cannot say whether the query without final is performant written or not (but most likely it is not)

Zarathustra2 avatar Sep 19 '23 11:09 Zarathustra2

If you say so. I will stop derailing this issue, I think I've made a clear case for FINAL.

hkrutzer avatar Sep 19 '23 11:09 hkrutzer

The case is not clear because I don't see the table definition and the queries you used that result in that huge performance diff

Edit: I am not against the PR but I am just saying that in all cases you can have at least equal performance and in most cases better performance without final modifier

Zarathustra2 avatar Sep 19 '23 11:09 Zarathustra2

Closing this for now as Ecto :hints field in from and join seems like the easiest workaround for SAMPLE, special JOINS, and some other clauses in ClickHouse.

ruslandoga avatar Oct 15 '24 12:10 ruslandoga