jitsu icon indicating copy to clipboard operation
jitsu copied to clipboard

Support for CollapsingMergeTree in Clickhouse to avoid 'order by' limitation

Open amadrizwan opened this issue 4 years ago • 0 comments

Problem

Jitsu supports storing all events from anonymous users and updates them in DWH with user id after user identification. When identification_nodes is received the events are replayed and in case of Clickhouse, ReplicatedMergeTree should take care of de-duplication. For this to work, order by "column" should match with the new event which means it will work as long we only have e.g. eventn_ctx_event_id in order by. If we include identification_nodes e.g. "user_id" in order by, it will not de-duplicate, as user_id was null in the first event. This has performance penalty when running queries.

Solution

Can we consider using https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/collapsingmergetree/ instead?

This requires Jitsu to add SIGN column when inserting events. On identification, replay events with sign -1 and insert new events with identification_node.

By doing this, we can add identification_nodes column/s as sorting key

amadrizwan avatar Oct 18 '21 13:10 amadrizwan