posthog
posthog copied to clipboard
ClickHouse - KafkaEngine - failure scenarios
Bug description
We should make sure the ClickHouse Kafka engine is configured correctly to don't DDOS Kafka in case of issues. We should make sure we have a way to configure:
- error retries, exponential backoff with jitter
- insert rate (as by going at full speed it can melt both CH and Kafka)
Environment
- [X] PostHog Cloud
- [X] self-hosted PostHog (ClickHouse-based), version/commit: please provide
So there isn't too much we can do in terms of configuration here. We've set kafka_skip_broken_messages
but it seems that for some serialization errors this isn't respected.
Things to consider:
- Increasing
kafka_num_consumers
- Seeing if 22.3 fixes this issue
- Trying a different format for the data
- Keep fixing each individual problem as it comes about
If we keep having problems we might just need to write our own service to insert into ClickHouse.
Although looking at this again it seems kafka_skip_broken_messages
is not set on the DLQ. I'll configure that and it should handle errors like these.
It's a terrible pattern though that CH has that if it isn't set to skip broken messages it keeps coming back for the same message batch indefinitely.
Once we verify those issues are not available in the latest CH version, what do you think about opening an issue upstream to keep track of the enhancements?
Here's what we currently have in code & cloud (via SHOW CREATE TABLE kafka_...
):
Table | cloud | code |
---|---|---|
events_json | 100 | 100 - code |
events proto | already nuked | 100 - code |
session recordings | 0 | 0 - code |
person | 0 | 0 -- code |
DLQ | 0 | 1000 - in Yakko's PR |
groups | 10 | 0 -- code |
plugin log entries | 0 | 0 -- code |
Note that for groups the code is out of sync from reality in cloud.
Should we set the skip_broken_messages
everywhere to 100 (except DLQ which we set to 1000)?
Yeah let's set 100 everywhere. DLQ can be more as it is less important.
Could you write that up for us @tiina303?
This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the stale
label – otherwise this will be closed in two weeks.
This issue was closed due to lack of activity. Feel free to reopen if it's still relevant.