posthog icon indicating copy to clipboard operation
posthog copied to clipboard

ClickHouse - KafkaEngine - failure scenarios

Open guidoiaquinti opened this issue 2 years ago • 6 comments

Bug description

We should make sure the ClickHouse Kafka engine is configured correctly to don't DDOS Kafka in case of issues. We should make sure we have a way to configure:

  • error retries, exponential backoff with jitter
  • insert rate (as by going at full speed it can melt both CH and Kafka)

Environment

  • [X] PostHog Cloud
  • [X] self-hosted PostHog (ClickHouse-based), version/commit: please provide

guidoiaquinti avatar May 12 '22 09:05 guidoiaquinti

So there isn't too much we can do in terms of configuration here. We've set kafka_skip_broken_messages but it seems that for some serialization errors this isn't respected.

Things to consider:

  1. Increasingkafka_num_consumers
  2. Seeing if 22.3 fixes this issue
  3. Trying a different format for the data
  4. Keep fixing each individual problem as it comes about

If we keep having problems we might just need to write our own service to insert into ClickHouse.

yakkomajuri avatar May 12 '22 10:05 yakkomajuri

Although looking at this again it seems kafka_skip_broken_messages is not set on the DLQ. I'll configure that and it should handle errors like these.

It's a terrible pattern though that CH has that if it isn't set to skip broken messages it keeps coming back for the same message batch indefinitely.

yakkomajuri avatar May 12 '22 10:05 yakkomajuri

Once we verify those issues are not available in the latest CH version, what do you think about opening an issue upstream to keep track of the enhancements?

guidoiaquinti avatar May 12 '22 11:05 guidoiaquinti

Here's what we currently have in code & cloud (via SHOW CREATE TABLE kafka_...):

Table cloud code
events_json 100 100 - code
events proto already nuked 100 - code
session recordings 0 0 - code
person 0 0 -- code
DLQ 0 1000 - in Yakko's PR
groups 10 0 -- code
plugin log entries 0 0 -- code

Note that for groups the code is out of sync from reality in cloud. Should we set the skip_broken_messages everywhere to 100 (except DLQ which we set to 1000)?

tiina303 avatar May 12 '22 16:05 tiina303

Yeah let's set 100 everywhere. DLQ can be more as it is less important.

Could you write that up for us @tiina303?

yakkomajuri avatar May 12 '22 17:05 yakkomajuri

This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

posthog-bot avatar May 13 '24 07:05 posthog-bot

This issue was closed due to lack of activity. Feel free to reopen if it's still relevant.

posthog-bot avatar May 27 '24 07:05 posthog-bot