sqlmesh Clickhouse and SQLMesh real world usage issues

Clickhouse and SQLMesh real world usage issues

Open ravenac95 opened this issue 5 months ago • 2 comments

Let me preface by saying, that I understand the clickhouse engine is still very new.

After some considerable work on our side to get try to get things to work properly with SQLMesh and clickhouse we’ve realized a few things that make using sqlmesh with clickhouse unusable with the built-in materialization strategies. We are fine writing our own custom strategy at this time but we wanted to take the opportunity to offer some insight into the things that we’ve seen and how we suggest to improve it.

Clickhouse and deletes

As described to us by a clickhouse engineer they consider the written to be “immutable” so deletions in clickhouse are not generally part of the suggested workflow. Normally if a DELETE command is used on clickhouse, the deletion is handled asynchronously. Forcing it to be a synchronous operation is costly for the clickhouse engine. In our experiments, when trying to backfill a very large model that sources some time series data, we found that the execution would take days to complete. Part of the issue is that we used a batch_size that was not the full size of the database but this was to reduce the memory requirements on our deployment. As sqlmesh handles incremental materializations by doing a delete then an insert, even if the table is initially empty, the clickhouse cluster cannot optimize this and synchronous delete operations still take a long time to return with a success. If you instead attempt to allow for the deletes to be synchronous and continuously attempt to write to the table, this can lead to some concurrency issues that are not entirely predictable.

Some suggestions for solutions

Append only incremental
- This would obviously be some kind of opt in flag but doing this would allow the use of table engines like “ReplacingMergeTree” or simply speed up the process and leave de-duplication to the user of clickhouse
- The obvious issues I see with this are:
  - A full restatement may still need to allow deletes or write to a completely different table then swap the references in the final schema.
  - A partial restatement would probably need to be slow and one would hope that maybe we can pass run time parameters to improve efficiency of the materialization

Some requests

Allow setting different materialization strategies depending on the dialect in use.
- We currently have some things setup to use duckdb with some local test data to test our models. It’s one of the reasons we’ve been trying to integrate sqlmesh more into our data pipeline as opposed to dbt. For this specific instance, if we wrote a custom materialization strategy for clickhouse it would nice if we could just use the normal INCREMENTAL_BY_TIME_RANGE strategy.
Make it easier to subclass the existing materialization strategies
- Unless sqlmesh can include an “append-only” mode into the incremental by time range strategy

Other things to note

Each model, when using the delete + insert style of materialization, must not run concurrent batches if there are batches set. This seemed to cause some unpredictable concurrency issues due to the deletions mutating the storage.

Sep 25 '24 22:09 ravenac95

sqlmesh sqlmesh copied to clipboard

Clickhouse and SQLMesh real world usage issues

Clickhouse and deletes

Some suggestions for solutions

Some requests

Other things to note

sqlmesh
sqlmesh copied to clipboard