sqlmesh
sqlmesh copied to clipboard
Clickhouse and SQLMesh real world usage issues
Let me preface by saying, that I understand the clickhouse engine is still very new.
After some considerable work on our side to get try to get things to work properly with SQLMesh and clickhouse we’ve realized a few things that make using sqlmesh with clickhouse unusable with the built-in materialization strategies. We are fine writing our own custom strategy at this time but we wanted to take the opportunity to offer some insight into the things that we’ve seen and how we suggest to improve it.
Clickhouse and deletes
As described to us by a clickhouse engineer they consider the written to be “immutable” so deletions in clickhouse are not generally part of the suggested workflow. Normally if a DELETE
command is used on clickhouse, the deletion is handled asynchronously. Forcing it to be a synchronous operation is costly for the clickhouse engine. In our experiments, when trying to backfill a very large model that sources some time series data, we found that the execution would take days to complete. Part of the issue is that we used a batch_size
that was not the full size of the database but this was to reduce the memory requirements on our deployment. As sqlmesh handles incremental materializations by doing a delete then an insert, even if the table is initially empty, the clickhouse cluster cannot optimize this and synchronous delete operations still take a long time to return with a success. If you instead attempt to allow for the deletes to be synchronous and continuously attempt to write to the table, this can lead to some concurrency issues that are not entirely predictable.
Some suggestions for solutions
- Append only incremental
- This would obviously be some kind of opt in flag but doing this would allow the use of table engines like “ReplacingMergeTree” or simply speed up the process and leave de-duplication to the user of clickhouse
- The obvious issues I see with this are:
- A full restatement may still need to allow deletes or write to a completely different table then swap the references in the final schema.
- A partial restatement would probably need to be slow and one would hope that maybe we can pass run time parameters to improve efficiency of the materialization
Some requests
- Allow setting different materialization strategies depending on the dialect in use.
- We currently have some things setup to use duckdb with some local test data to test our models. It’s one of the reasons we’ve been trying to integrate sqlmesh more into our data pipeline as opposed to dbt. For this specific instance, if we wrote a custom materialization strategy for clickhouse it would nice if we could just use the normal
INCREMENTAL_BY_TIME_RANGE
strategy.
- We currently have some things setup to use duckdb with some local test data to test our models. It’s one of the reasons we’ve been trying to integrate sqlmesh more into our data pipeline as opposed to dbt. For this specific instance, if we wrote a custom materialization strategy for clickhouse it would nice if we could just use the normal
- Make it easier to subclass the existing materialization strategies
- Unless sqlmesh can include an “append-only” mode into the incremental by time range strategy
Other things to note
- Each model, when using the delete + insert style of materialization, must not run concurrent batches if there are batches set. This seemed to cause some unpredictable concurrency issues due to the deletions mutating the storage.