clickhouse-driver icon indicating copy to clipboard operation
clickhouse-driver copied to clipboard

Add new JSON type.

Open gsergey418 opened this issue 9 months ago • 18 comments

Add support for the new JSON type. My company uses this driver lib to interact with CH and we rely heavily on the JSON type. Any help would be greatly appreciated since I don't know much about writing database drivers.

  • fixes #473
  • fixes #460

Checklist:

  • [x] Add tests that demonstrate the correct behavior of the change. Tests should fail without the change.
  • [x] Add or update relevant docs, in the docs folder and in code.
  • [x] Ensure PR doesn't contain untouched code reformatting: spaces, etc.
  • [x] Run flake8 and fix issues.
  • [x] Run pytest no tests failed. See https://clickhouse-driver.readthedocs.io/en/latest/development.html.

gsergey418 avatar Feb 17 '25 14:02 gsergey418

@xzkostyan I've marked the PR as ready, please review.

gsergey418 avatar Feb 27 '25 14:02 gsergey418

Not ready yet, after some testing with our software we found out that this column has trouble parsing Array(JSON) and Array(Array) values. Will fix this in the next few days, the majority of code should still be good for review.

gsergey418 avatar Feb 28 '25 09:02 gsergey418

Seems to be ready now, added additional tests for checking compatibility with the CLI.

gsergey418 avatar Feb 28 '25 14:02 gsergey418

There's a problem that it's not compatible with the CLI when using array-nested dictionaries at more than five layers of nesting. I probably won't add this functionality since it's too niche of a use case.

gsergey418 avatar Feb 28 '25 14:02 gsergey418

The problem was the inability to deserialize shared JSON when max_dynamic_paths and max_dynamic_types are exceeded. It uses a separate undocumented format I sadly don't have time to work on right now. Dynamic JSON serialization I implemented seems to be fully compatible with itself and works well for objects inserted with other libraries/clients that don't exceed the dynamic paths limit (1024 dynamic paths by default and /4 on every array-nested dictionary). Requesting review @xzkostyan

gsergey418 avatar Mar 04 '25 14:03 gsergey418

Any indication on when this will be merged?

jhKessler avatar Mar 12 '25 19:03 jhKessler

+1

KingMohan45 avatar Mar 26 '25 10:03 KingMohan45

Any update on this? This is a big blocker for me currently.

chrisgoddard avatar Apr 03 '25 20:04 chrisgoddard

Any update on this? This is a big blocker for me currently.

Not yet, the maintainer has to review the PR. You can try this out for now and let me know if it works.

$ pip uninstall clickhouse-driver
$ pip install git+https://github.com/gsergey418/clickhouse-driver@newjson#egg=clickhouse-driver

gsergey418 avatar Apr 06 '25 07:04 gsergey418

+1

etos avatar May 05 '25 17:05 etos

Any update on this? This is a big blocker for me currently.

Not yet, the maintainer has to review the PR. You can try this out for now and let me know if it works.

$ pip uninstall clickhouse-driver
$ pip install git+https://github.com/gsergey418/clickhouse-driver@newjson#egg=clickhouse-driver

In my test case, it works. Thanks for the PR. 👍

clickhouse-driver execute result

/Users/yuangezhizao/Documents/GitLab/<rm>-capacity-poc/.venv/bin/python /Users/yuangezhizao/Documents/GitLab/<rm>-capacity-poc/<rm>_capacity_poc/clickhouse/test_new_json_type.py 
DEBUG:clickhouse_driver.connection:Connecting. Database: default. User: default
DEBUG:clickhouse_driver.connection:Connecting to <rm>:9001
DEBUG:clickhouse_driver.connection:Connected to ClickHouse server version 25.4.2, revision: 54477
DEBUG:clickhouse_driver.connection:Query: DROP TABLE IF EXISTS `default`.test_new_json_type
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000023
INFO:__main__:Table deleted
DEBUG:clickhouse_driver.connection:Query: 
                CREATE TABLE `default`.test_new_json_type
                (
                    `id` UInt64,
                    `json_data` JSON()
                )
                ENGINE = MergeTree()
                ORDER BY id;
                
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000024
INFO:__main__:Table created
DEBUG:clickhouse_driver.connection:Query: INSERT INTO `default`.test_new_json_type (id, json_data) VALUES
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000028
DEBUG:clickhouse_driver.streams.native:Writing column id
DEBUG:clickhouse_driver.streams.native:Writing column json_data
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000856
DEBUG:clickhouse_driver.streams.native:Reading column host_name
DEBUG:clickhouse_driver.streams.native:Reading column current_time
DEBUG:tzlocal:/etc/localtime found
DEBUG:tzlocal:1 found:
 {'/etc/localtime is a symlink to': 'Asia/Shanghai'}
DEBUG:clickhouse_driver.streams.native:Reading column thread_id
DEBUG:clickhouse_driver.streams.native:Reading column type
DEBUG:clickhouse_driver.streams.native:Reading column name
DEBUG:clickhouse_driver.streams.native:Reading column value
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000056
DEBUG:clickhouse_driver.streams.native:Reading column host_name
DEBUG:clickhouse_driver.streams.native:Reading column current_time
DEBUG:clickhouse_driver.streams.native:Reading column thread_id
DEBUG:clickhouse_driver.streams.native:Reading column type
DEBUG:clickhouse_driver.streams.native:Reading column name
DEBUG:clickhouse_driver.streams.native:Reading column value
INFO:__main__:Data created
DEBUG:clickhouse_driver.connection:Query: SELECT id, JSONExtractKeysAndValuesRaw(toJSONString(json_data)) FROM `default`.test_new_json_type
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000015
DEBUG:clickhouse_driver.streams.native:Reading column id
DEBUG:clickhouse_driver.streams.native:Reading column JSONExtractKeysAndValuesRaw(toJSONString(json_data))
DEBUG:clickhouse_driver.streams.native:Reading column host_name
DEBUG:clickhouse_driver.streams.native:Reading column current_time
DEBUG:clickhouse_driver.streams.native:Reading column thread_id
DEBUG:clickhouse_driver.streams.native:Reading column type
DEBUG:clickhouse_driver.streams.native:Reading column name
DEBUG:clickhouse_driver.streams.native:Reading column value
INFO:__main__:ID: 1, Extracted JSON: [('attr', '{"attr16":"100"}')]
INFO:__main__:ID: 2, Extracted JSON: [('attr', '{"attr16":"string"}')]
INFO:__main__:ID: 3, Extracted JSON: [('attr', '{"attr16":[["aaaaa","bbbbb"]]}')]
INFO:__main__:ID: 4, Extracted JSON: [('attr', '{"attr16":[["ccccc"],"aaaaa","bbbbb"]}')]
INFO:__main__:ID: 5, Extracted JSON: [('attr', '{"attr_test":[["ccccc"],"aaaaa","bbbbb"]}')]
INFO:__main__:Data selected

Before

image

After

image

yuangezhizao avatar May 19 '25 10:05 yuangezhizao

@yuangezhizao Thanks for the test! Glad to hear it works

gsergey418 avatar May 19 '25 11:05 gsergey418

@gsergey418 thanks for the PR. Works smoothly with JSON types but with Array(JSON) I get an error by insert

DB::Exception: Array sizes are too large: 7164775599414010632

even if I try to insert empty arrays, sample table:

CREATE TABLE gcp_billing_test (
  user_id Int64,
  cost Float64,
  resource JSON DEFAULT '{}',
  service JSON DEFAULT '{}',
  labels Array(JSON) DEFAULT [],
  credits Array(JSON) DEFAULT []
)
ENGINE = MergeTree
PARTITION BY (user_id)
ORDER BY (user_id)

sample insert:

INSERT INTO gcp_billing_test (
  user_id,
  cost,
  resource,
  service,
  labels,
  credits
)
VALUES (
  123456789,
  42.75,
  '{"type": "compute", "location": "us-central1"}',
  '{"name": "Compute Engine", "category": "IaaS"}',
  [
    '{"env": "prod"}',
    '{"team": "infra"}'
  ],
  [
    '{"name": "Sustained Use Discount", "amount": 5.25}',
    '{"name": "Committed Use Discount", "amount": 3.10}'
  ]
);

stack trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/clickhouse_driver/dbapi/cursor.py", line 111, in execute
    response = execute(
  File "/usr/local/lib/python3.10/site-packages/clickhouse_backend/driver/client.py", line 44, in execute
    rv = self.process_insert_query(
  File "/usr/local/lib/python3.10/site-packages/clickhouse_driver/client.py", line 607, in process_insert_query
    rv = self.send_data(sample_block, data,
  File "/usr/local/lib/python3.10/site-packages/clickhouse_driver/client.py", line 665, in send_data
    self.receive_profile_events()
  File "/usr/local/lib/python3.10/site-packages/clickhouse_driver/client.py", line 747, in receive_profile_events
    raise packet.exception
clickhouse_driver.errors.ServerException: Code: 128.
DB::Exception: Array sizes are too large: 7164775599414010632

maciej-or avatar May 22 '25 09:05 maciej-or

Thanks for the PR. We really need this supported officially, as we rely heavily on json type.

rcampos87 avatar Aug 04 '25 08:08 rcampos87

I would also like to voice my support for this PR (or any other PR that can implement JSON support for this library). clickhouse-connect offers JSON support, but it isn't able to automatically transform null values into default values for non-nullable columns. clickhouse-driver does offer that feature, but doesn't support JSON.

jgb avatar Aug 05 '25 08:08 jgb

When I tested this PR, I hit one issue: when the JSON contains dynamic keys (like user-defined keys) the SELECT could fail if the data part contains more than 128 distincts keys.

My usage is JSON fields that store attribute of a log line. Each log line's attribute could be user-defined (not a fixed set of keys, e.g. one line could have http.response.status = 200 the other postgresql.version = 17). It result that all log lines could collectively have more than 128 keys.

The following test exhibit the bug:

    def test_json_13(self):
        """ Reproduce an issue when we read a JSON column that had more than 127 distinct keys across all rows
        """
        # The issue don't occur with Memory engine, so use MergeTree
        create_table_template = 'CREATE TABLE test ({}) ENGINE = MergeTree ORDER BY tuple()'

        with self.create_table("a JSON", template=create_table_template):
            unique_keys = [
                f"unique_key_{idx}"
                for idx in range(128)
            ]

            # There is also an issue when we try to insert more than 127 distinct keys in a single
            # query, but this isn't what this test show, so send the unique keys in multiple inserts.
            max_batch_size = 64
            batches = [unique_keys[i:i + max_batch_size] for i in range(0, len(unique_keys), max_batch_size)]
            for batch in batches:
                data = [
                    ({key: 1},)
                    for key in batch
                ]
                self.client.execute("INSERT INTO test (a) VALUES", data)
            
            # We submitted multiple inserts (which create multiple data parts). Each of these data parts contains
            # only max_batch_size distinct keys.
            # Once merged into a single data part, the resulting data part will
            # contains all unique_keys which exceed the threshold of ~120 unique keys.
            # Force merge of the data parts into a single one
            self.client.execute("OPTIMIZE TABLE test FINAL")

            query = "SELECT * FROM test"
            result = self.client.execute(query)
            result_keys = {key for row in result for key in row[0].keys()}
            self.assertEqual(result_keys, set(unique_keys))

PierreF avatar Sep 03 '25 10:09 PierreF

Any updates on this?

shachibista avatar Oct 10 '25 08:10 shachibista

When will this get merged?

KabirSingh-Marrow avatar Oct 27 '25 12:10 KabirSingh-Marrow