ecto_ch
ecto_ch copied to clipboard
drop hints support
It doesn't seem like ClickHouse supports hints. In Plausible query.hints are used as a workaround to pass the SAMPLE clause.
Maybe it can be similar to input/1 helper function:
q = from t in sample("table", 0.1), select: count(t.id) * 10
# select count(t.id) * 10 from table t sample 0.1
I use them to add FINAL as I wrote here https://github.com/plausible/ecto_ch/pull/103/files (which I will update soon :) )
It doesn't seem like FINAL is a hint though. I think ClickHouse is using settings in place of hints: https://github.com/ClickHouse/ClickHouse/issues/11719#issuecomment-646027073
Indeed it's not a hint but it seems to be the only way to pass it into that query from Ecto. I don't think Ecto supports a settings clause?
It doesn't have to go into the Ecto query, settings can be passed separately from the query.
I think final still has some issues because it cannot move always the where clause to prewhere. See https://github.com/ClickHouse/ClickHouse/issues/31411
So I would advise to not use/promote the usage of final too much.
Some queries cannot be performed without FINAL but I think that is tangential to this issue.
Really? Do you have an example? I am curious because I cannot think of an example maybe because I never encountered such a situation
When using an AggregatingMergeTree and you need the merged result for example
You can use aggregating functions - sum, avg, argMaxMerge, etc. - + group by to have the same effect as your final query
I don't use it like that, I merge several partial records into one final record, like this (simplified):
`id` FixedString(26),
`started_at` DateTime('UTC') MATERIALIZED ULIDStringToDateTime(id),
`disconnected_at` SimpleAggregateFunction(max, DateTime('UTC')),
`user_responded_at` SimpleAggregateFunction(max, DateTime('UTC')),
`properties` SimpleAggregateFunction(groupArrayArray, Array(String))
Additional properties (in the properties column) can be added to the record at any time, and the disconnected_at and user_responded_at are inserted later as well. They are eventually merged based on the ID column.
But can't you just do
select id, started_at, max(disconnected_at) as disconnected_at, max(user_responded_at) as user_responded_at, groupArrayArray(properties) as properties from <table> group by id, started_at
Maybe I am not really following. Still I don't think there is any query which you can only run with final and not as a group by + aggregation.
I don't know if I can, but why would I do it that way? That is a way longer query. Perhaps you are right and FINAL is not strictly speaking necessary but I don't see the advantage of writing a longer query.
It depends on your use case. If you have a small dataset final should be fine. If you have a larger dataset final should be avoided. If you don't run into performance issues or don't care about speed then final is also fine.
If I query with a group by, I get
10 rows in set. Elapsed: 6.001 sec. Processed 33.86 million rows, 802.51 MB (5.64 million rows/s., 133.73 MB/s.)
and with the FINAL modifier:
10 rows in set. Elapsed: 0.584 sec. Processed 94.00 thousand rows, 110.76 KB (160.89 thousand rows/s., 189.56 KB/s.)
With GROUP BY, Clickhouse will probably scan every row, while FINAL can be performed in a more efficient way.
When I repeat the query it is even worse:
10 rows in set. Elapsed: 6.421 sec. Processed 33.86 million rows, 802.52 MB (5.27 million rows/s., 124.98 MB/s.)
remains.
With FINAL:
10 rows in set. Elapsed: 0.096 sec. Processed 94.01 thousand rows, 111.21 KB (982.83 thousand rows/s., 1.16 MB/s.)
So this is over 70 times slower and it looks like it will slow down further with additional records.
Group by
select conversation_id, max(disconnected_at) as disconnected_at, max(user_responded_at) as user_responded_at from conversations group by conversation_id limit 10 format Vertical
final
select conversation_id, disconnected_at, user_responded_at from conversations final limit 10 format Vertical
I mean 0.584 isn't bad depending on use case but also I dont see the full query & table so cannot say whether the query without final is performant written or not (but most likely it is not)
If you say so. I will stop derailing this issue, I think I've made a clear case for FINAL.
The case is not clear because I don't see the table definition and the queries you used that result in that huge performance diff
Edit: I am not against the PR but I am just saying that in all cases you can have at least equal performance and in most cases better performance without final modifier
Closing this for now as Ecto :hints field in from and join seems like the easiest workaround for SAMPLE, special JOINS, and some other clauses in ClickHouse.