posthog ClickHouse - KafkaEngine

Bug description

We should make sure the ClickHouse Kafka engine is configured correctly to don't DDOS Kafka in case of issues. We should make sure we have a way to configure:

error retries, exponential backoff with jitter
insert rate (as by going at full speed it can melt both CH and Kafka)

Environment

[X] PostHog Cloud
[X] self-hosted PostHog (ClickHouse-based), version/commit: please provide

May 12 '22 09:05 guidoiaquinti

So there isn't too much we can do in terms of configuration here. We've set kafka_skip_broken_messages but it seems that for some serialization errors this isn't respected.

Things to consider:

Increasingkafka_num_consumers
Seeing if 22.3 fixes this issue
Trying a different format for the data
Keep fixing each individual problem as it comes about

If we keep having problems we might just need to write our own service to insert into ClickHouse.

May 12 '22 10:05 yakkomajuri

Although looking at this again it seems kafka_skip_broken_messages is not set on the DLQ. I'll configure that and it should handle errors like these.

It's a terrible pattern though that CH has that if it isn't set to skip broken messages it keeps coming back for the same message batch indefinitely.

May 12 '22 10:05 yakkomajuri

Once we verify those issues are not available in the latest CH version, what do you think about opening an issue upstream to keep track of the enhancements?

May 12 '22 11:05 guidoiaquinti

Here's what we currently have in code & cloud (via SHOW CREATE TABLE kafka_...):

Table	cloud	code
events_json	100	100 - code
events proto	already nuked	100 - code
session recordings	0	0 - code
person	0	0 -- code
DLQ	0	1000 - in Yakko's PR
groups	10	0 -- code
plugin log entries	0	0 -- code

Note that for groups the code is out of sync from reality in cloud. Should we set the skip_broken_messages everywhere to 100 (except DLQ which we set to 1000)?

May 12 '22 16:05 tiina303

Yeah let's set 100 everywhere. DLQ can be more as it is less important.

Could you write that up for us @tiina303?

May 12 '22 17:05 yakkomajuri

This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

May 13 '24 07:05 posthog-bot

This issue was closed due to lack of activity. Feel free to reopen if it's still relevant.

May 27 '24 07:05 posthog-bot

posthog
posthog copied to clipboard

ClickHouse - KafkaEngine - failure scenarios

Bug description

Environment

posthog posthog copied to clipboard

ClickHouse - KafkaEngine - failure scenarios

Bug description

Environment

posthog
posthog copied to clipboard