Add new JSON type.
Add support for the new JSON type. My company uses this driver lib to interact with CH and we rely heavily on the JSON type. Any help would be greatly appreciated since I don't know much about writing database drivers.
- fixes #473
- fixes #460
Checklist:
- [x] Add tests that demonstrate the correct behavior of the change. Tests should fail without the change.
- [x] Add or update relevant docs, in the docs folder and in code.
- [x] Ensure PR doesn't contain untouched code reformatting: spaces, etc.
- [x] Run
flake8and fix issues. - [x] Run
pytestno tests failed. See https://clickhouse-driver.readthedocs.io/en/latest/development.html.
@xzkostyan I've marked the PR as ready, please review.
Not ready yet, after some testing with our software we found out that this column has trouble parsing Array(JSON) and Array(Array) values. Will fix this in the next few days, the majority of code should still be good for review.
Seems to be ready now, added additional tests for checking compatibility with the CLI.
There's a problem that it's not compatible with the CLI when using array-nested dictionaries at more than five layers of nesting. I probably won't add this functionality since it's too niche of a use case.
The problem was the inability to deserialize shared JSON when max_dynamic_paths and max_dynamic_types are exceeded. It uses a separate undocumented format I sadly don't have time to work on right now. Dynamic JSON serialization I implemented seems to be fully compatible with itself and works well for objects inserted with other libraries/clients that don't exceed the dynamic paths limit (1024 dynamic paths by default and /4 on every array-nested dictionary). Requesting review @xzkostyan
Any indication on when this will be merged?
+1
Any update on this? This is a big blocker for me currently.
Any update on this? This is a big blocker for me currently.
Not yet, the maintainer has to review the PR. You can try this out for now and let me know if it works.
$ pip uninstall clickhouse-driver
$ pip install git+https://github.com/gsergey418/clickhouse-driver@newjson#egg=clickhouse-driver
+1
Any update on this? This is a big blocker for me currently.
Not yet, the maintainer has to review the PR. You can try this out for now and let me know if it works.
$ pip uninstall clickhouse-driver $ pip install git+https://github.com/gsergey418/clickhouse-driver@newjson#egg=clickhouse-driver
In my test case, it works. Thanks for the PR. 👍
clickhouse-driver execute result
/Users/yuangezhizao/Documents/GitLab/<rm>-capacity-poc/.venv/bin/python /Users/yuangezhizao/Documents/GitLab/<rm>-capacity-poc/<rm>_capacity_poc/clickhouse/test_new_json_type.py
DEBUG:clickhouse_driver.connection:Connecting. Database: default. User: default
DEBUG:clickhouse_driver.connection:Connecting to <rm>:9001
DEBUG:clickhouse_driver.connection:Connected to ClickHouse server version 25.4.2, revision: 54477
DEBUG:clickhouse_driver.connection:Query: DROP TABLE IF EXISTS `default`.test_new_json_type
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000023
INFO:__main__:Table deleted
DEBUG:clickhouse_driver.connection:Query:
CREATE TABLE `default`.test_new_json_type
(
`id` UInt64,
`json_data` JSON()
)
ENGINE = MergeTree()
ORDER BY id;
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000024
INFO:__main__:Table created
DEBUG:clickhouse_driver.connection:Query: INSERT INTO `default`.test_new_json_type (id, json_data) VALUES
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000028
DEBUG:clickhouse_driver.streams.native:Writing column id
DEBUG:clickhouse_driver.streams.native:Writing column json_data
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000856
DEBUG:clickhouse_driver.streams.native:Reading column host_name
DEBUG:clickhouse_driver.streams.native:Reading column current_time
DEBUG:tzlocal:/etc/localtime found
DEBUG:tzlocal:1 found:
{'/etc/localtime is a symlink to': 'Asia/Shanghai'}
DEBUG:clickhouse_driver.streams.native:Reading column thread_id
DEBUG:clickhouse_driver.streams.native:Reading column type
DEBUG:clickhouse_driver.streams.native:Reading column name
DEBUG:clickhouse_driver.streams.native:Reading column value
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000056
DEBUG:clickhouse_driver.streams.native:Reading column host_name
DEBUG:clickhouse_driver.streams.native:Reading column current_time
DEBUG:clickhouse_driver.streams.native:Reading column thread_id
DEBUG:clickhouse_driver.streams.native:Reading column type
DEBUG:clickhouse_driver.streams.native:Reading column name
DEBUG:clickhouse_driver.streams.native:Reading column value
INFO:__main__:Data created
DEBUG:clickhouse_driver.connection:Query: SELECT id, JSONExtractKeysAndValuesRaw(toJSONString(json_data)) FROM `default`.test_new_json_type
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000015
DEBUG:clickhouse_driver.streams.native:Reading column id
DEBUG:clickhouse_driver.streams.native:Reading column JSONExtractKeysAndValuesRaw(toJSONString(json_data))
DEBUG:clickhouse_driver.streams.native:Reading column host_name
DEBUG:clickhouse_driver.streams.native:Reading column current_time
DEBUG:clickhouse_driver.streams.native:Reading column thread_id
DEBUG:clickhouse_driver.streams.native:Reading column type
DEBUG:clickhouse_driver.streams.native:Reading column name
DEBUG:clickhouse_driver.streams.native:Reading column value
INFO:__main__:ID: 1, Extracted JSON: [('attr', '{"attr16":"100"}')]
INFO:__main__:ID: 2, Extracted JSON: [('attr', '{"attr16":"string"}')]
INFO:__main__:ID: 3, Extracted JSON: [('attr', '{"attr16":[["aaaaa","bbbbb"]]}')]
INFO:__main__:ID: 4, Extracted JSON: [('attr', '{"attr16":[["ccccc"],"aaaaa","bbbbb"]}')]
INFO:__main__:ID: 5, Extracted JSON: [('attr', '{"attr_test":[["ccccc"],"aaaaa","bbbbb"]}')]
INFO:__main__:Data selected
Before
After
@yuangezhizao Thanks for the test! Glad to hear it works
@gsergey418 thanks for the PR. Works smoothly with JSON types but with Array(JSON) I get an error by insert
DB::Exception: Array sizes are too large: 7164775599414010632
even if I try to insert empty arrays, sample table:
CREATE TABLE gcp_billing_test (
user_id Int64,
cost Float64,
resource JSON DEFAULT '{}',
service JSON DEFAULT '{}',
labels Array(JSON) DEFAULT [],
credits Array(JSON) DEFAULT []
)
ENGINE = MergeTree
PARTITION BY (user_id)
ORDER BY (user_id)
sample insert:
INSERT INTO gcp_billing_test (
user_id,
cost,
resource,
service,
labels,
credits
)
VALUES (
123456789,
42.75,
'{"type": "compute", "location": "us-central1"}',
'{"name": "Compute Engine", "category": "IaaS"}',
[
'{"env": "prod"}',
'{"team": "infra"}'
],
[
'{"name": "Sustained Use Discount", "amount": 5.25}',
'{"name": "Committed Use Discount", "amount": 3.10}'
]
);
stack trace:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/clickhouse_driver/dbapi/cursor.py", line 111, in execute
response = execute(
File "/usr/local/lib/python3.10/site-packages/clickhouse_backend/driver/client.py", line 44, in execute
rv = self.process_insert_query(
File "/usr/local/lib/python3.10/site-packages/clickhouse_driver/client.py", line 607, in process_insert_query
rv = self.send_data(sample_block, data,
File "/usr/local/lib/python3.10/site-packages/clickhouse_driver/client.py", line 665, in send_data
self.receive_profile_events()
File "/usr/local/lib/python3.10/site-packages/clickhouse_driver/client.py", line 747, in receive_profile_events
raise packet.exception
clickhouse_driver.errors.ServerException: Code: 128.
DB::Exception: Array sizes are too large: 7164775599414010632
Thanks for the PR. We really need this supported officially, as we rely heavily on json type.
I would also like to voice my support for this PR (or any other PR that can implement JSON support for this library). clickhouse-connect offers JSON support, but it isn't able to automatically transform null values into default values for non-nullable columns. clickhouse-driver does offer that feature, but doesn't support JSON.
When I tested this PR, I hit one issue: when the JSON contains dynamic keys (like user-defined keys) the SELECT could fail if the data part contains more than 128 distincts keys.
My usage is JSON fields that store attribute of a log line. Each log line's attribute could be user-defined (not a fixed set of keys, e.g. one line could have http.response.status = 200 the other postgresql.version = 17). It result that all log lines could collectively have more than 128 keys.
The following test exhibit the bug:
def test_json_13(self):
""" Reproduce an issue when we read a JSON column that had more than 127 distinct keys across all rows
"""
# The issue don't occur with Memory engine, so use MergeTree
create_table_template = 'CREATE TABLE test ({}) ENGINE = MergeTree ORDER BY tuple()'
with self.create_table("a JSON", template=create_table_template):
unique_keys = [
f"unique_key_{idx}"
for idx in range(128)
]
# There is also an issue when we try to insert more than 127 distinct keys in a single
# query, but this isn't what this test show, so send the unique keys in multiple inserts.
max_batch_size = 64
batches = [unique_keys[i:i + max_batch_size] for i in range(0, len(unique_keys), max_batch_size)]
for batch in batches:
data = [
({key: 1},)
for key in batch
]
self.client.execute("INSERT INTO test (a) VALUES", data)
# We submitted multiple inserts (which create multiple data parts). Each of these data parts contains
# only max_batch_size distinct keys.
# Once merged into a single data part, the resulting data part will
# contains all unique_keys which exceed the threshold of ~120 unique keys.
# Force merge of the data parts into a single one
self.client.execute("OPTIMIZE TABLE test FINAL")
query = "SELECT * FROM test"
result = self.client.execute(query)
result_keys = {key for row in result for key in row[0].keys()}
self.assertEqual(result_keys, set(unique_keys))
Any updates on this?
When will this get merged?