vector icon indicating copy to clipboard operation
vector copied to clipboard

ClickHouse Sink - Allow FORMAT options such as JSONStringsEachRow and JSONAsString

Open taythebot opened this issue 3 years ago • 0 comments

A note for the community

  • Please vote on this issue by adding a šŸ‘ reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

ClickHouse supports multiple ways to import raw JSON data. Currently the sink uses "JSONEachRow" by default. While this works for simple JSON structures, it does not work for complex structures.

Take this JSON for example:

{ "foo": "bar", "a": { "b": "c", "d": { "e": "f" } } }

This cannot be imported into ClickHouse using JSONEachRow. To import it you would have to manually reformat the data with remap.

The easiest way would be to use JSONAsString which inserts the entire line as a string. This then imported into a table with the Null engine and a Materialized View is used to parse the JSON inside ClickHouse and write it to another table.

Example from Altinity

create table entrypoint(J String) Engine=Null;
create table datastore(a String, i Int64, f Float64) Engine=MergeTree order by a;

create materialized view jsonConverter to datastore
as select (JSONExtract(J, 'Tuple(String,Tuple(Int64,Float64))') as x),
        x.1 as a,
        x.2.1 as i,
        x.2.2 as f
from entrypoint;

$ echo '{"s": "val1", "b2": {"i": 42, "f": 0.1}}' | \
    clickhouse-client -q "insert into entrypoint format JSONAsString"

$ echo '{"s": "val1","b2": {"i": 33, "f": 0.2}},{"s": "val1","b2": {"i": 34, "f": 0.2}}' | \
   clickhouse-client -q "insert into entrypoint format JSONAsString"

SELECT * FROM datastore;
ā”Œā”€a────┬──i─┬───f─┐
│ val1 │ 42 │ 0.1 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”˜
ā”Œā”€a────┬──i─┬───f─┐
│ val1 │ 33 │ 0.2 │
│ val1 │ 34 │ 0.2 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”˜

https://kb.altinity.com/altinity-kb-schema-design/altinity-kb-jsonasstring-and-mat.-view-as-json-parser/

Attempted Solutions

No response

Proposal

It would be nice to have an option like format that has a default value of JSONEachRow and allows options of JSONStringsEachRow, JSONCompactEachRow, JSONObjectEachRow, and JSONObjectEachRow. This will allow for more flexibility and complex data being inserted into ClickHouse.

[sinks.my_sink_id]
type = "clickhouse"
inputs = [ "my-source-or-transform-id" ]
database = "mydatabase"
endpoint = "http://localhost:8123"
table = "mytable"
compression = "gzip"
format = "JSONAsString"

The code change would be at https://github.com/vectordotdev/vector/blob/master/src/sinks/clickhouse/http_sink.rs#L134

References

Formats: https://clickhouse.com/docs/en/interfaces/formats/#jsonstringseachrow Altinity Guide to importing JSON: https://kb.altinity.com/altinity-kb-schema-design/altinity-kb-jsonasstring-and-mat.-view-as-json-parser/

Version

vector 0.26.0 (x86_64-unknown-linux-gnu c6b5bc2 2022-12-05)

taythebot avatar Jan 04 '23 18:01 taythebot