dbt-clickhouse
dbt-clickhouse copied to clipboard
Cannot add new columns to model in distributed cluster setup
Describe the bug
When I try to add columns to a model in a distributed setup using ReplicatedMergeTree
, I am not able to add new columns.
Running the model for the first time succeeds. Running it again also works. Adding new columns and running it, produces the following error:
Database Error in model my_test_model (models/example/my_test_model.sql) :HTTPDriver for http://127.0.0.1:8123 returned response code 500) Code: 20. DB::Exception: Number of columns doesn't match. (NUMBER_OF_COLUMNS_DOESNT_MATCH) (version 23.8.4.69 (official build)) compiled Code at target/run/dap_dbt/models/example/my_test_model.sql
Steps to reproduce
I am using a local clickhouse-cluster setup from this repo and changed the Clickhouse server version to 23.8.4.
dbt_project.yml
models:
<project-name>:
example:
+schema: example_schema
+materialized: distributed_table
+engine: ReplicatedMergeTree('/clickhouse_prod/tables/{shard}/{database}/{table}/{uuid}','{replica}')
+sharding_key: rand()
profiles.yml
clickhouse:
target: local
outputs:
local:
type: clickhouse
cluster: company_cluster
cluster_mode: true
host: 127.0.0.1
port: 8123
user: <user>
password: <pw>
secure: False
local_suffix: '_base'
- Install dbt and the
dbt-clickhouse
driver - Create a dbt project with the
dbt init
command - Add a new model to the example models
SELECT
1 AS col1
- Run
dbt run
- Add a new column: Change the model to
SELECT
1 AS col1,
2 AS col2
- Run
dbt run
- Now you should see the error
Expected behaviour
It should be possible to modify the data models
Code examples, such as models or profile settings
For the code, see the reproduceable example above.
dbt and/or ClickHouse server logs
[0m10:20:18.151952 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10503bee0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x11034f280>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x11032ab80>]}
============================== 10:20:18.154455 | 58a4a500-b3ac-4be4-aeb6-06b20b9c6195 ==============================
[0m10:20:18.154455 [info ] [MainThread]: Running with dbt=1.7.4
[0m10:20:18.154934 [debug] [MainThread]: running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'log_cache_events': 'False', 'write_json': 'True', 'partial_parse': 'True', 'cache_selected_only': 'False', 'warn_error': 'None', 'version_check': 'True', 'fail_fast': 'False', 'log_path': '/Users/user/scr/foo/dbt/logs', 'debug': 'False', 'profiles_dir': '/Users/user/scr/foo/dbt', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'static_parser': 'True', 'introspect': 'True', 'invocation_command': 'dbt run', 'target_path': 'None', 'log_format': 'default', 'send_anonymous_usage_stats': 'True'}
[0m10:20:18.255659 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'project_id', 'label': '58a4a500-b3ac-4be4-aeb6-06b20b9c6195', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x110336d60>]}
[0m10:20:18.301238 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'adapter_info', 'label': '58a4a500-b3ac-4be4-aeb6-06b20b9c6195', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1104339a0>]}
[0m10:20:18.301903 [info ] [MainThread]: Registered adapter: clickhouse=1.7.1
[0m10:20:18.311209 [debug] [MainThread]: checksum: 248e8aba381ba2d577dce7ab50d2010c2aac45d2b27e6e6214cdd96d2d41fabc, vars: {}, profile: , target: , version: 1.7.4
[0m10:20:18.330665 [debug] [MainThread]: Partial parsing enabled: 2 files deleted, 0 files added, 0 files changed.
[0m10:20:18.331118 [debug] [MainThread]: Partial parsing: deleted file: dap_dbt://models/example/my_second_dbt_model.sql
[0m10:20:18.331475 [debug] [MainThread]: Partial parsing: deleted file: dap_dbt://models/example/my_first_dbt_model.sql
[0m10:20:18.360316 [warn ] [MainThread]: [[33mWARNING[0m]: Did not find matching node for patch with name 'my_second_dbt_model' in the 'models' section of file 'models/example/schema.yml'
[0m10:20:18.361077 [warn ] [MainThread]: [[33mWARNING[0m]: Did not find matching node for patch with name 'my_first_dbt_model' in the 'models' section of file 'models/example/schema.yml'
[0m10:20:18.367090 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'load_project', 'label': '58a4a500-b3ac-4be4-aeb6-06b20b9c6195', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x111a910d0>]}
[0m10:20:18.373093 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'resource_counts', 'label': '58a4a500-b3ac-4be4-aeb6-06b20b9c6195', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1119cceb0>]}
[0m10:20:18.373573 [info ] [MainThread]: Found 4 tests, 1 model, 3 sources, 0 exposures, 0 metrics, 424 macros, 0 groups, 0 semantic models
[0m10:20:18.373962 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': '58a4a500-b3ac-4be4-aeb6-06b20b9c6195', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1119ccdc0>]}
[0m10:20:18.375014 [info ] [MainThread]:
[0m10:20:18.375619 [debug] [MainThread]: Acquiring new clickhouse connection 'master'
[0m10:20:18.376378 [debug] [ThreadPool]: Acquiring new clickhouse connection 'list_'
[0m10:20:18.408273 [debug] [ThreadPool]: Opening a new connection, currently in state init
[0m10:20:18.588183 [debug] [ThreadPool]: dbt_clickhouse adapter: On list_: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "clickhouse", "target_name": "local", "connection_name": "list_"} */
select name from system.databases
...
[0m10:20:18.592411 [debug] [ThreadPool]: dbt_clickhouse adapter: SQL status: OK in 0.00 seconds
[0m10:20:18.596395 [debug] [ThreadPool]: Re-using an available connection from the pool (formerly list_, now list__dap_test)
[0m10:20:18.605721 [debug] [ThreadPool]: dbt_clickhouse adapter: On list__dap_test: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "clickhouse", "target_name": "local", "connection_name": "list__dap_test"} */
select
t.name as name,
t.database as schema,
if(engine not in ('MaterializedView', 'View'), 'table', 'view') as type,
db.engine as db_engine,count(distinct _shard_num) > 1 as is_on_cluster
from clusterAllReplicas("company_cluster", system.tables) as t
join system.databases as db on t.database = db.name
where schema = 'dap_test'
group by name, schema, type, db_engine
...
[0m10:20:18.621645 [debug] [ThreadPool]: dbt_clickhouse adapter: SQL status: OK in 0.02 seconds
[0m10:20:18.628407 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': '58a4a500-b3ac-4be4-aeb6-06b20b9c6195', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x11054e670>]}
[0m10:20:18.629964 [info ] [MainThread]: Concurrency: 1 threads (target='local')
[0m10:20:18.630826 [info ] [MainThread]:
[0m10:20:18.633613 [debug] [Thread-1 ]: Began running node model.dap_dbt.my_test_model
[0m10:20:18.634713 [info ] [Thread-1 ]: 1 of 1 START sql distributed_table model `dap_test`.`my_test_model` ............ [RUN]
[0m10:20:18.635406 [debug] [Thread-1 ]: Re-using an available connection from the pool (formerly list__dap_test, now model.dap_dbt.my_test_model)
[0m10:20:18.635833 [debug] [Thread-1 ]: Began compiling node model.dap_dbt.my_test_model
[0m10:20:18.640441 [debug] [Thread-1 ]: Writing injected SQL for node "model.dap_dbt.my_test_model"
[0m10:20:18.641058 [debug] [Thread-1 ]: Timing info for model.dap_dbt.my_test_model (compile): 10:20:18.636082 => 10:20:18.640876
[0m10:20:18.641423 [debug] [Thread-1 ]: Began executing node model.dap_dbt.my_test_model
[0m10:20:18.656704 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.dap_dbt.my_test_model: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "clickhouse", "target_name": "local", "node_id": "model.dap_dbt.my_test_model"} */
SELECT value FROM system.settings WHERE name = 'insert_distributed_sync'
...
[0m10:20:18.662288 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.01 seconds
[0m10:20:18.674479 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.dap_dbt.my_test_model: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "clickhouse", "target_name": "local", "node_id": "model.dap_dbt.my_test_model"} */
drop table if exists `dap_test`.`my_test_model_base__dbt_backup`
ON CLUSTER "company_cluster"
SYNC...
[0m10:20:18.793194 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.12 seconds
[0m10:20:18.795789 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.dap_dbt.my_test_model: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "clickhouse", "target_name": "local", "node_id": "model.dap_dbt.my_test_model"} */
drop table if exists `dap_test`.`my_test_model__dbt_tmp`
ON CLUSTER "company_cluster"
SYNC...
[0m10:20:18.907635 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.11 seconds
[0m10:20:18.913208 [debug] [Thread-1 ]: Writing runtime sql for node "model.dap_dbt.my_test_model"
[0m10:20:18.913923 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.dap_dbt.my_test_model: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "clickhouse", "target_name": "local", "node_id": "model.dap_dbt.my_test_model"} */
create view `dap_test`.`my_test_model__dbt_tmp`
ON CLUSTER "company_cluster"
as (
SELECT
1 AS col1,
2 AS col2
)
...
[0m10:20:19.024544 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.11 seconds
[0m10:20:19.031654 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.dap_dbt.my_test_model: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "clickhouse", "target_name": "local", "node_id": "model.dap_dbt.my_test_model"} */
select name, type from system.columns where table = 'my_test_model__dbt_tmp'
and database = 'dap_test'
order by position
...
[0m10:20:19.034340 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.00 seconds
[0m10:20:19.042311 [debug] [Thread-1 ]: Writing runtime sql for node "model.dap_dbt.my_test_model"
[0m10:20:19.042978 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.dap_dbt.my_test_model: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "clickhouse", "target_name": "local", "node_id": "model.dap_dbt.my_test_model"} */
create table `dap_test`.`my_test_model_base__dbt_backup`
ON CLUSTER "company_cluster" (
col1 UInt8, col2 UInt8
)
engine = ReplicatedMergeTree('/clickhouse_kaj/tables/{shard}/{database}/{table}/{uuid}','{replica}')
order by (tuple())
SETTINGS replicated_deduplication_window=0
...
[0m10:20:19.157311 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.11 seconds
[0m10:20:19.164326 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.dap_dbt.my_test_model: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "clickhouse", "target_name": "local", "node_id": "model.dap_dbt.my_test_model"} */
SYSTEM SYNC REPLICA
ON CLUSTER "company_cluster" dap_test.my_test_model_base
...
[0m10:20:19.280411 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.12 seconds
[0m10:20:19.283074 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.dap_dbt.my_test_model: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "clickhouse", "target_name": "local", "node_id": "model.dap_dbt.my_test_model"} */
EXCHANGE TABLES `dap_test`.`my_test_model_base__dbt_backup` AND `dap_test`.`my_test_model_base`
ON CLUSTER "company_cluster"
...
[0m10:20:19.406989 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.12 seconds
[0m10:20:19.416056 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.dap_dbt.my_test_model: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "clickhouse", "target_name": "local", "node_id": "model.dap_dbt.my_test_model"} */
select name, type from system.columns where table = 'my_test_model'
and database = 'dap_test'
order by position
...
[0m10:20:19.425092 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.01 seconds
[0m10:20:19.427366 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.dap_dbt.my_test_model: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "clickhouse", "target_name": "local", "node_id": "model.dap_dbt.my_test_model"} */
insert into `dap_test`.`my_test_model`
("col1")SELECT
1 AS col1,
2 AS col2
...
[0m10:20:19.433975 [debug] [Thread-1 ]: dbt_clickhouse adapter: Error running SQL: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "clickhouse", "target_name": "local", "node_id": "model.dap_dbt.my_test_model"} */
insert into `dap_test`.`my_test_model`
("col1")SELECT
1 AS col1,
2 AS col2
[0m10:20:19.434827 [debug] [Thread-1 ]: Timing info for model.dap_dbt.my_test_model (execute): 10:20:18.641645 => 10:20:19.434506
[0m10:20:19.461346 [debug] [Thread-1 ]: Database Error in model my_test_model (models/example/my_test_model.sql)
:HTTPDriver for http://127.0.0.1:8123 returned response code 500)
Code: 20. DB::Exception: Number of columns doesn't match. (NUMBER_OF_COLUMNS_DOESNT_MATCH) (version 23.8.4.69 (official build))
compiled Code at target/run/dap_dbt/models/example/my_test_model.sql
[0m10:20:19.461963 [debug] [Thread-1 ]: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '58a4a500-b3ac-4be4-aeb6-06b20b9c6195', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x13013f7f0>]}
[0m10:20:19.462649 [error] [Thread-1 ]: 1 of 1 ERROR creating sql distributed_table model `dap_test`.`my_test_model` ... [[31mERROR[0m in 0.83s]
[0m10:20:19.463311 [debug] [Thread-1 ]: Finished running node model.dap_dbt.my_test_model
[0m10:20:19.464772 [debug] [MainThread]: Connection 'master' was properly closed.
[0m10:20:19.465221 [debug] [MainThread]: Connection 'model.dap_dbt.my_test_model' was left open.
[0m10:20:19.465659 [debug] [MainThread]: On model.dap_dbt.my_test_model: Close
[0m10:20:19.466133 [info ] [MainThread]:
[0m10:20:19.466592 [info ] [MainThread]: Finished running 1 distributed_table model in 0 hours 0 minutes and 1.09 seconds (1.09s).
[0m10:20:19.467215 [debug] [MainThread]: Command end result
[0m10:20:19.514420 [info ] [MainThread]:
[0m10:20:19.515006 [info ] [MainThread]: [31mCompleted with 1 error and 0 warnings:[0m
[0m10:20:19.515365 [info ] [MainThread]:
[0m10:20:19.515726 [error] [MainThread]: Database Error in model my_test_model (models/example/my_test_model.sql)
:HTTPDriver for http://127.0.0.1:8123 returned response code 500)
Code: 20. DB::Exception: Number of columns doesn't match. (NUMBER_OF_COLUMNS_DOESNT_MATCH) (version 23.8.4.69 (official build))
compiled Code at target/run/dap_dbt/models/example/my_test_model.sql
[0m10:20:19.516077 [info ] [MainThread]:
[0m10:20:19.516434 [info ] [MainThread]: Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
[0m10:20:19.517919 [debug] [MainThread]: Resource report: {"command_name": "run", "command_wall_clock_time": 1.3992585, "process_user_time": 2.46728, "process_kernel_time": 1.179818, "process_mem_max_rss": "154533888", "command_success": false, "process_in_blocks": "0", "process_out_blocks": "0"}
[0m10:20:19.518379 [debug] [MainThread]: Command `dbt run` failed at 10:20:19.518264 after 1.40 seconds
[0m10:20:19.518743 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10503bee0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x130067490>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x110336d60>]}
[0m10:20:19.519107 [debug] [MainThread]: Flushing usage events
Configuration
Environment
- dbt version: 1.7.3
- dbt-clickhouse version: 1.7.0
- clickhouse-driver version (if using native): 0.2.6
- clickhouse-connect version (if using http): 0.6.23
- Python version: 3.9.13
- Operating system: MAC OS
ClickHouse server
See the cluster setup information on top
I did some further investigation and found the following:
When
EXCHANGE TABLES `dap_test`.`my_test_model_base__dbt_backup` AND `dap_test`.`my_test_model_base` ON CLUSTER "company_cluster";
is run, the tables seem to be switched based on their definition:
The distributed table also still points to the my_test_model_base
model as before (but we expect this one to be updated).
However, if I run SELECT * FROM dap_test.my_test_model_base
I also get only 1 column:
Same for SELECT * FROM dap_test.my_test_model
:
However, if I run SELECT * FROM dap_test.my_test_model_base__dbt_backup
, I get the 2 columns:
Potential solution?
When I redefine the my_test_model table after the EXCHANGE TABLES
step with the following statement, I get the correct columns:
create or replace table `dap_test`.`my_test_model`
ON CLUSTER "prod" as `dap_test`.`my_test_model_base`
ENGINE = Distributed('prod', 'dap_test', 'my_test_model_base', rand());
@genzgd I created a small PR with a change that fixed it for me locally: https://github.com/ClickHouse/dbt-clickhouse/pull/230 Maybe that can help figuring out a solution. I am happy to help but I have hardly any context on dbt adapters.
When updating local tables underneath a distributed table, one should drop and refresh the distributed table in ClickHouse. Distributed table support was only added recently to this adapter and at least used to be experimental. Is there any step you found that would drop and recreate the Distributed table @Somtom?
Distributed tables don't store any data, and those that point at the schema of a local table need to be refreshed like you just did when the underlying local table is updated. However, there isn't really a problem replacing the distributed tables each time.
I guess a dbt full refresh of the table would also fix the issue (assuming it's not incremental). And if I am right dropping the distributed table in a pre-hook should also work?
@emirkmo unfortunately I did not look into dropping and recreating the distributed table. Currently, I also don't work with the setup anymore - so it is hard to look into it for me .
This appears to be fixed by #230 without the need to drop and recreate the distributed tables, but I could see it being reopened. Unfortunately we don't currently have the resources to fully investigate issues with community contributed experimental features.