dlt
dlt copied to clipboard
Feat/2200 add clickhouse distributed support
Description
When Clickhouse is setup with replicas and distributed tables, DDL & DML statements need to be modified:
- ON CLUSTER needs to added after the table name
- A pair of base table and a distributed table needs to be created
- ALTER and DROP need to be executed for both tables, base and distributed
- Deletes need to be done on the base table
- Add GLOBAL to all joins as we can't know which do not needed
This was achieved by adding additional configuration parameters for Clickhouse and modifying queries in the execute_query function in the sql_client file. This way we didn't change any core dlt functionality and all changes are restricted only to clickhouse destination and only in a single function. And for Clickhouse changes will be applied only if the configuration is set for the distributed Clickhouse setup
To test the changes Clickhouse with a few replicated nodes is needed.
Related Issues
- Resolves 2200 Cluster support for Clickhouse
Additional Context
Deploy Preview for dlt-hub-docs canceled.
| Name | Link |
|---|---|
| Latest commit | 10be8b327d558239b8685329a099518d73924ed3 |
| Latest deploy log | https://app.netlify.com/projects/dlt-hub-docs/deploys/68bfdaade6ecd800085a38f3 |
Thx for all the suggestions @rudolfix! I'll have a look at those.
Thx for all the suggestions @rudolfix! I'll have a look at those.
please keep me posted, this PR is pretty cool so if you get stuck or don't have time lmk.
Hey @rudolfix
I adjusted the PR.
- Changed CREATE, ALTER, and DROP statements in functions where they are defined
- Added new config options to clickhouse.md with an additional explanation.
I decided not to change the code within _to_temp_table because it's not only about the engine. The ON CLUSTER clause would also need to be added, but without access to the configured cluster name, it cannot be made generic. I'm also unsure whether ReplicatedMergeTree would work for setups with a single node, as I don't have an instance to test it. This will still be handled in execute_query.
I used SQLGlot where possible. Unfortunately, SQLGlot doesn't fully support every ClickHouse statement, and I clearly stated this in the comments. The solution looks a bit like Frankenstein's monster, combining sqlglot and RegExp. I tested the code with GSheet, Stripe, and Salesforce, and it runs without issue. Let me know if you have any suggestions or questions.
P.S. SQLGlot changes ` to ", I couldn't find a way to change that behaviour
Edit: Latest commit fixes a bug with TRUNCATE statments, they should work on base tables and not distributed tables or the data is not removed (also stated in the comment). Good thing about testing in production is that all bugs are cought... eventually...
Edit 2: I have initially missed that it's not allowed to run updates on the disitributed table, those need to be executed on the base table. This is now adjusted. Sorry for missing this initially. I am now testing merge with scd2 and this came up. Also, I could not find the function which generates update statments. I found gen_update_table_prefix and that one currently doesn't have access to config so I can't make it work for a single node and a disitributed setup. Again, I'm hadling this in the execute_query
I've had to reverse the merge of the devel branch as it caused issues with our pipelines. Most of the issue can be resolved. We could not resolve the issue with sqlglot schema validation when using sqlclient to run a query which queried some tables created by dlt
Any update?
hey @zstipanicev ! we do a cleanup of all community destinations that we are not merging in the core library. they'll get a docs page where we link to your repo. the proposal for the page is here: https://github.com/dlt-hub/dlt/pull/3326
pls tell me
- if you still want us to link to your fork
- if you want to change anything in there (you can also do a PR or request changes to comments)
we'll close this PR soon
What is the status of this ? DLT does not support clickhouse cluster at the moment
@LPauzies supporting cluster requires building a variant of clickhouse destination - queries are very different. we tried to rewrite existing queries with sqlglot but resulting code was hard to maintain (even worse than string replacement). you can find destination in fork connected in this PR but dlt version is AFAIK pretty old.
@LPauzies the version is 1.10, and we are using it in production. @rudolfix the issue I had merging this to a newer version was related to code we had in our Salesforce connector not the Clichhouse destination code. And yes, you cannot at the moment adjust the queries with sqlglot as it doesn't fully support Clickhouse in a distributed setup. We are migrating away from Clickhouse as our main/only DWH so I'm not sure how long I'll be able to support if anyone wants to merge this into the latest version. And if using only sqlglot is a must to merge into main, I have no idea what is the sqlglot plan on supporting the Clickhouse in the distributed setup.