posthog icon indicating copy to clipboard operation
posthog copied to clipboard

Guaranteeing job execution

Open yakkomajuri opened this issue 1 year ago • 3 comments

TL;DR

Let's use the main DB as a backup for Aurora when enqueuing jobs.

See https://github.com/PostHog/posthog/pull/11785

Problem

Our jobs pipeline is currently not robust enough.

Initially we noticed it could not withstand even minimal Aurora downtime, so we implemented in-memory retries (#11561). This cut to 0 the number of jobs lost due to Aurora scaling.

Screenshot 2022-09-12 at 11 53 43

However, as one can see from the graph above, we were still not great at handling Aurora being down for an extended period of time.

Solution

What we need is a tiered queueing system, that tries to enqueue the job on the next available queue when a given queue fails, and so on.

This was thought through during the first implementation of job queues, where Marius envisioned a 3-queue system to start:

  1. Memory queue
  2. Graphile (worker) queue
  3. S3 queue

It was thus implemented with this in mind, and the code will loop through available queues trying to enqueue the job. Except we only have one queue in place.

As a result, I got to thinking about what would be our options for new queues that would work for both us and self-hosted folks.

a) Memory

A memory queue is more of a nice-to-have than it is something we can rely on. Essentially, if by the time a job gets to be enqueued, it's already time to run it, pop it on the memory queue. This avoids the overhead involved with going over the network to talk to an external service.

This would be very useful given that most of the jobs are triggered with runNow() or runIn(3, 'seconds'). But again, not too useful for robustness.

Note that I've had a call with the creator/maintainer of Graphile Worker and he did mention potentially baking this directly into the Graphile Worker Node client.

Main questions:

  • What happens on crashes? We can try to enqueue last minute somewhere else but are likely to lose jobs here

b) S3

S3 was part of the original spec, and the queue was even built out. It never worked in production however.

S3 is one option we have, particularly now that minio ships for self-hosted folks. The main issue with it is the implementation and maintenance overhead.

c) Redis

What other data stores do self-hosted users have access to? Well, Redis!

Redis has quite a good amount of useful commands for dealing with sets and queues and could be a nice option here.

Main questions:

  • We treat Redis (rightfully so) as ephemeral storage. What happens if it crashes/the keys get evicted?

d) Double Postgres setup

I think this is by far the best option in that it meets our criteria for robustness, carries the lowest implementation/maintenance overhead, will work for self-hosted users, and can be shipped today.

Essentially, let's use 2 Postgres instances. We already have 2 Postgres instances on Cloud (the main DB, and the Aurora DB for jobs), so we might as well leverage them.

Essentially, let's allow a graphile-backup queue to be turned on, which will be positioned after the Aurora DB queue. Thus, if and only if we cannot enqueue to Aurora, we drop the job in the main DB.

This way, enqueuing a job looks like this:

  1. Try to enqueue to Aurora
  2. Try to enqueue to main DB
  3. Retry from step 1 with exponential backoff

That should offer us strong guarantees without any extra implementation. Both queues will be backed by the Graphile worker so debugging is easier.

Also, if self-hosted users want the same guarantees, they can also spin up a separate DB for jobs and the main DB will be used as a backup.

e) ClickHouse

Something to maybe consider, but unlike Postgres, we'd need to build the queueing scaffolding from scratch, and CH isn't very well suited to write-heavy workflows.

f) Multi-node jobs DB setup

Something we can think of for Cloud, but definitely out of question for self-hosted. Too much overhead either way.

g) Kafka

We've tried processing events from Kafka on a schedule before. It wasn't made for it, so I wouldn't want to do this. Only way Kafka might be able to come in is as a retry queue, but not a processing queue.

Closing remarks

Ultimately I think we should do d immediately. In the future we may want to add more queues, but for now I propose this as the course of action.

yakkomajuri avatar Sep 12 '22 15:09 yakkomajuri

This will improve the availability of some Postgres instance.

I'm not totally up to speed with how the queuing / buffering scheduling works so I'm interested to know what happens if, say, both Postgres instance are down for a day but then become available again, what happens to the events in the intervening period? Would we end up with the same results in ClickHouse as if the Postgres instances were always available over that period?

hazzadous avatar Sep 12 '22 15:09 hazzadous

At the moment if Postgres is down we'll anyway be sending events to the dead letter queue. Currently nothing happens to these but I'd like to change that (see #11749).

yakkomajuri avatar Sep 12 '22 15:09 yakkomajuri

Me and @hazzadous discussed this and came to the following conclusion:

Jobs to be enqueued will be sent to a new Kafka topic. Upon consuming from that topic, if it's time to run the job, we'll just immediately trigger it. Else we'll drop it in Graphile, where we'll reduce the polling interval.

The goal with this approach is to keep our core dependencies minimal. We rely heavily on Kafka, so will make use of it here rather than adding another core dependency via an approach like the one suggested in #11785

yakkomajuri avatar Sep 13 '22 18:09 yakkomajuri

cc @macobo who had some thoughts on this

yakkomajuri avatar Sep 23 '22 14:09 yakkomajuri

I suggest trying out this solution before building a tiered system: Let's create a new RDS-backed database as the main graphile work queue.

Everything described in this issue sounds valid and gets us ready for 2024 but I think is overkill for the now and we might achieve enough nines just switching away from Aurora. This is relatively easy to validate by just switching URLs around and seeing job stats - if they're still bad, let's go down the path of more complexity.

As to how we might consume from two graphile queues at once - set up separate env vars from READ and WRITE graphile queues, set up RDS and set up things to R/W from RDS and set up temporary async workers to read from old.

macobo avatar Sep 27 '22 07:09 macobo