oso sqlmesh runs seem to always take hours

What is it?

I only wanted to copy 1 column into metrics_v0, and it turned into a multi-hour run. Is there a bug? I guess logging could help determine this.

Sep 09 '24 16:09 ryscheng

Ok. So I've been spending maybe too much time on this but the current performance of sqlmesh + clickhouse is seemingly untenable.

What we're seeing right now are a few things:

Clickhouse doesn't handle delete very well. It's an asynchronous action normally but in order to do the incremental loads of data sqlmesh does a DELETE. If asynchronous mode there are unexpected errors caused by data not being ready due to the data still mutating in clickhouse
Concurrent requests also seem to have similar problems related to this in clickhouse - there's more to test here but it's not looking good
We can get this to run without issue but then we need to make it synchronous and currently serial. The caveat is that it seems that when running this it takes ages. After 5 hours of execution I've only completed 5% of queued queries (this is if we process all time)

I've messed with various scaling/optimization techniques:

Partitioning - This seems to have helped with some memory issues we had initially when running some of the queries with sqlmesh on some of the larger time rolling windows. So good on that
Increasing concurrency on sqlmesh. This seemed to help but then I ran into issues with mutating the same database that is being deleted from. I think there's a way for me to set concurrency limits on a per model basis so any given model doesn't do any concurrent things.
Instead of using the default MergeTree I have switched to VersionedCollapsingMergeTree in some cases. This seems to help significantly. Some requests without it that were taking ~5-10mins are now taking less than or within a minute. There are some serious caveats with this approach but it might be worth using. See the docs

I don't want to abandon cilckhouse but it's defnitely giving me a bunch of headaches. I think, particularly for the way our metrics are calculated, we can do massive parallel processing, but there seems to be limited capacity for this right now.

Sep 18 '24 05:09 ravenac95

Hey! Curious if you have explored ClickHouse Materialized Views for this. Curious what issues you faced if that's the case!

Also, ClickHouse is... a beast, but a complicated beast. Have you considered getting some consulting help? The only thing I know about ClickHouse is that the right setting and setup might make a query go from never ending to completing in milliseconds. :sweat_smile:

Sep 18 '24 09:09 davidgasquez

@davidgasquez ya sadly I did look into materialized views but many of our queries are based on using rolling windows of last X days/months/years. Previously with bigquery we attempted to calculate this all at once by essentially generating data for all time at all times but it ended up running out of memory. It also limited the amount of recursion that one could do with those rolling windows. Now with sqlmesh we actually are generating metrics based on various dimensions by just calling a timeseries_metrics function that takes some configuration for metrics and automatically generates models for each of the dimensions we care about and all the specific times/rolling windows. So sadly, I think, except for simple aggregations, rolling windows won't work well on clickhouse unless i'm misunderstanding. That being said, yeah we should definitely have a chat with a clickhouse engineer.

Funny enough running this with duckdb. Totally no problems haha. If only motherduck were on gcp.

I think the solution really is using the VersionedCollapsingMergeTree it just means we rewrite all of the metric aggregation functions to be aware of the Sign that is present in the table. However in my tests I had a thing go from 6min execution to 5seconds - both were looking at something like 90 day periods.

All this being said, because of the way our metrics are currently calculated (each as separate tables and then joined in a view later), I think it actually fits really well with horizontally scalable architectures like Trino, starrocks, spark, or druid. Clickhouse doesn't do well with horizontal scale based on all that I've seen, but I think these others solutions might actually perform well here. Sadly it's a bit to get that setup and tested as well.

Sep 18 '24 20:09 ravenac95

Oh interestingly though, I'm realizing I can improve the performance of some of the upstream models that we use to calculate the metrics with materialized views. I hadn't thought about this until now but in theory that should take care of a bit part of the initial calculation's issues. Though I may have to abandon doing any testing on duckdb.

This could also be used with some of the other metric aggregations that aren't rolling

Sep 18 '24 23:09 ravenac95

Well hrm, this might be exactly what i need: https://clickhouse.com/docs/en/sql-reference/statements/create/view#window-view-experimental

but I imagine isn't supported with sqlmesh. hrmmmmm

Sep 18 '24 23:09 ravenac95

@ravenac95 CCed you on a thread to ask the Clickhouse folks directly. They've offered engineering support in the past.

Feel free to dump questions in the thread for them to forward to the right person

Sep 19 '24 06:09 ryscheng

Links from the call

Marcelo Rodriguez 1:40 PM https://clickhouse.com/docs/en/sql-reference/statements/explain explain indexes = 1 <your_query>; Marcelo Rodriguez 1:42 PM https://clickhouse.com/docs/knowledgebase/tips-tricks-optimizing-basic-data-types-in-clickhouse Marcelo Rodriguez 1:45 PM https://clickhouse.com/blog/clickhouse-fully-supports-joins-part1 Marcelo Rodriguez 1:47 PM https://clickhouse.com/blog/common-getting-started-issues-with-clickhouse Marcelo Rodriguez 1:49 PM https://clickhouse.com/blog/monitoring-troubleshooting-insert-queries-clickhouse Marcelo Rodriguez 1:52 PM https://clickhouse.com/blog/clickhouse-dbt-project-introduction-and-webinar

Sep 24 '24 20:09 ryscheng

Worth asking the sqlmesh folks whether there is an "append/insert-only" incremental strategy that doesn't require deletes

Sep 24 '24 21:09 ryscheng

Going to close this in favor of a new issue to investigate sqlmesh runs. This was a bit of an all consuming issue.

Oct 22 '24 16:10 ravenac95