posthog icon indicating copy to clipboard operation
posthog copied to clipboard

Bubble Postgres connection errors up to main thread

Open yakkomajuri opened this issue 1 year ago • 1 comments

Currently our default approach for handling Postgres being unavailable is to send events to the dead letter queue (DLQ).

At the moment we don't do anything with these events, even though we have been meaning to for a while.

However, if we're sure Postgres is down, we probably shouldn't do this. Instead, we should consider bubbling up the error to the main thread in order to cause the offset to not be committed so that events will be reprocessed by default.

This would be a better pattern than sending messages we couldn't process due to ephemeral issues to the DLQ, and at least while we don't build a better retry mechanism, prevent us from losing events.

A follow-up here is what we should do if we fail to enqueue a buffer job. Through #11784 we're looking for ways to make it so that that doesn't happen, so this might be a non-issue. But for now, if Aurora is unavailable events that would go to the buffer end up in the dead letter queue. One approach is to just skip the buffer if we can't enqueue the buffer job. This is not great as it could mess up metrics but it's better than what we currently do: send it to the DLQ and never process it (and even if we did we might mess up metrics too this way).

Back to the original problem, we need to carefully notify the main thread of the error, so it can throw inside the consumer but in a way that's handled correctly.

Thoughts @macobo @tiina303 ?

yakkomajuri avatar Sep 12 '22 18:09 yakkomajuri

Sounds like a good idea, one concern could be that if Postgres is down due to load spike (imagine a DDOS) then this could make things worse. Related issue: https://github.com/PostHog/posthog/issues/10396

tiina303 avatar Sep 19 '22 12:09 tiina303

How could things be made worse?

The idea is to use Kafka as the buffer if Postgres is hard down. i.e. rather than picking up the events from there and sending to the DLQ just let them sit in Kafka

yakkomajuri avatar Sep 23 '22 14:09 yakkomajuri