peerdb icon indicating copy to clipboard operation
peerdb copied to clipboard

Support Nullable(JSON) in ClickHouse

Open aldofunes opened this issue 2 months ago • 4 comments

This is my attempt at addressing issue #3546.

I managed to make it work on my setup, which pulls NULL JSONB columns in a Postgres db into Nullabl(JSON) columns in ClickHouse

aldofunes avatar Oct 24 '25 18:10 aldofunes

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Oct 24 '25 18:10 CLAassistant

Tested this change locally, it looks like with this change a regression is introduced: CDC is no longer replicating data correctly. It replicates empty object for JSON, or NULL for Nullable(JSON)).

(also, looks like our e2e tests for MongoDB should have caught this regression but didn't, so thanks for surfacing this issue.)

jgao54 avatar Nov 04 '25 06:11 jgao54

@jgao54, you are totally right, how did I miss that?

I found that in the clone step, the avro files have a different schema from the CDC step.

Here's an example:

java -jar avro-tools-1.12.1.jar getschema cdc_file.avro

{
  "type" : "record",
  "name" : "_peerdb_raw_aggregators_basculin_dwh",
  "fields" : [ {
    "name" : "_peerdb_uid",
    "type" : "string"
  }, {
    "name" : "_peerdb_timestamp",
    "type" : "long"
  }, {
    "name" : "_peerdb_destination_table_name",
    "type" : "string"
  }, {
    "name" : "_peerdb_data",
    "type" : "string"
  }, {
    "name" : "_peerdb_record_type",
    "type" : [ "null", "long" ]
  }, {
    "name" : "_peerdb_match_data",
    "type" : [ "null", "string" ]
  }, {
    "name" : "_peerdb_batch_id",
    "type" : [ "null", "long" ]
  }, {
    "name" : "_peerdb_unchanged_toast_columns",
    "type" : [ "null", "string" ]
  } ]
}

java -jar avro-tools-1.12.1.jar getschema clone_file.avro

{
  "type" : "record",
  "name" : "aggregators_basculin_activity_views",
  "fields" : [ {
    "name" : "user_id_0",
    "type" : {
      "type" : "string",
      "logicalType" : "uuid"
    }
  }, {
    "name" : "operation_id_1",
    "type" : {
      "type" : "string",
      "logicalType" : "uuid"
    }
  }, {
    "name" : "data_2",
    "type" : [ "null", "string" ]
  }, {
    "name" : "created_at_3",
    "type" : {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    }
  }, {
    "name" : "updated_at_4",
    "type" : {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    }
  } ]
}

In other words, the CDC step is wrapping the record in an envelope, while the Clone step is dumping the record by itself.

For the life of me, I could not get the tests to run on my machine, but I was able to figure out that in the CDC step, it needs to first extract the jsonb column to a Nullable(String) and then cast that to a Nullable(JSON), using JSONExtractString() does not work because null values get converted to an empty string (instead of null) and hence, the cast to Nullable(JSON) fails.

I basically had to split the handling of both JSON and Nullable(JSON) as separate steps. I have zero experience with golang, so please forgive me if my approach is too naive.

aldofunes avatar Nov 05 '25 18:11 aldofunes

Thanks for the contribution! I'll run some tests this week and if things look good will get this merged.

jgao54 avatar Nov 11 '25 09:11 jgao54