deps icon indicating copy to clipboard operation
deps copied to clipboard

Background workers

Open RealOrangeOne opened this issue 1 year ago • 17 comments

This is a proposal for adding a background workers interface to Django, to better facilitate developers (both working with Django and library maintainers) moving certain tasks from their web processes to separate background runners.

This work originally started in Wagtail as RFC 72, but was found to be generic enough to also apply to Django, so @carltongibson suggested I adapt it to a DEP. See also my more complete code examples.

Got feedback? Feel free to leave a comment? Got bigger feedback, I'm more than happy to chat on Twitter, Mastodon, or another channel if you already know me (so this PR doesn't get too messy).

View rendered content

RealOrangeOne avatar Feb 09 '24 12:02 RealOrangeOne

Thanks so much for getting to this so quickly @RealOrangeOne! 🎁

I’ll have a proper read over the weekend but, one initial remark: I don’t think this should be in the contrib namespace. Either django.tasks or django.core.tasks. (I kind of prefer the former: from django import tasks would be 🥰)

carltongibson avatar Feb 09 '24 13:02 carltongibson

Hello! Thanks for putting this proposal up, IMO this would be great to have. Dropping some comments here for completeness (as we discussed in DMs), though I'm not sure this is the right place.

Task/transaction race condition

In a backend which queues tasks onto an external queue (let's say, Redis), there's a race condition between a task runner starting the task and the transaction being committed. If I as a caller create a PaymentRequest entity and then enqueue a task to process it, the task runner might start processing the task before the transaction has committed, and so the task runner won't be able to see the PaymentRequest entity. The way to avoid the race condition is to use transaction.on_commit to enqueue the task only after the transaction has committed.

The problem is, beyond being a bit of a pain to do manually and being easily forgotten, it's not clear what is best to do in the DatabaseBackend here. For the DatabaseBackend the opposite applies – because we're enqueueing the task inside the database, we want that to be transactional, and it's most unclear to me what the guaranteed semantics are of doing database writes in transaction.on_commit. For code which wants to portably support both backends, which I presume is the goal for Wagtail, this would presumably mean conditionally using or not using transaction.on_commit based on the backend, which is quite a pain, and doesn't necessarily generalise to new backends.

Transactions in ImmediateBackend

There are some fairly important subtleties with transactions in ImmediateBackend which it's important to nail down in the non-autocommit case. Specifically - does the task run in the same transaction as the caller, and how?

Case 1: Task runs in the same transaction as the caller

This is the 'obvious' implementation but comes with some major hazards.

First, if the task errors, this can leave the transaction in an inconsistent or aborted state, which is a fairly major footgun, not to mention a substantially different set of problems than would be experienced in production.

Second, in the case of asynchronous task runners, there is no guarantee that the caller will wait for the task to complete before closing the transaction, which would leave the task in a very peculiar state.

Case 2: Task runs in the same transaction as the caller, but with a savepoint

This is a bit more robust, but only limits rather than fixes the first problem, and does nothing for the second. Releasing a savepoint is also not the same thing as committing a transaction, and so this can't replicate on-commit behaviours either from the database or Django. The former can pop up errors on commit which won't show up here, and the latter matters for on_commit calls because these will then be run when the caller commits, not when the task completes.

Case 3: Task runs in a separate transaction

This provides the most robust semantics, and best mirrors the production use, but also opens up two pretty big cans of worms:

  1. How is this even implemented? A separate connection to the database? Are we talking threads? Does that work for both synchronous and asynchronous code? What about the transaction.on_commit problem?
  2. What does this mean for tests? Tests are likely to be one of if not the major use of the ImmediateBackend, but for fast tests we definitely want to be rolling back transactions after each test rather than doing the TRUNCATEs manually to reset each table. How would or could that interact with the separate transactions for the task? Can asynchronous, or even synchronous, tasks dangle after a test has finished?

Serialization in DatabaseBackend

I presume that the args and kwargs are serialized with Pickle in the DatabaseBackend? Some considerations:

  1. This is probably worth documenting explicitly, because it might be something of a surprise if switching to a different backend.
  2. Will the ImmediateBackend replicate the same serialization? If not, this could be a source of surprises when switching between backends. I'm also not sure of the implications of passing ORM objects as arguments to tasks and the difference between passing them with and without intermediate serialization.
  3. Moving a class to a different module, or a few other innocuous changes, will cause task dequeuing to suddenly break after a release for tasks enqueued beforehand. This is potentially quite surprising. Perhaps there's something to be learned from the serialization infrastructure for sessions, which have quite similar problems?
  4. If this used Pickle protocol version -1 (i.e. the latest) then there's a race condition around releases if a new version were released – producers would immediately start enqueuing tasks with the new version, but consumers would still be running the old version. Should the version be pinned? This is maybe not fundamentally different from the problem of rolling out new versions of classes which are being pickled and passed around in general, but it's also potentially much more surprising to developers if it happens "behind their backs". For instance, if you upgraded Python version in production, you might unexpectedly not be able to roll back because all the enqueued tasks are unexpectedly not backwards-compatible.

Standard priority ordering

Your priority ordering is an int but there are differing standards for what the ordering should be. There's quite differing prior art here:

Lower is more important: UNIX (nice levels), Celery (Redis backend)

Higher is more important: Windows thread scheduling priorities, AMQP (including RabbitMQ), Celery (RabbitMQ backend)

It seems like it would be a good idea to standardise on one of these because otherwise there's no way to portably do prioritisation, and portable prioritisation is presumably why that argument is there in the first place.

For backends which can't support priority, which includes potentially fairly major targets like SQS and Google Cloud Pub/Sub, what's the behaviour if you pass a priority argument? Is it silently ignored, and is that safe?

Dedicated capacity

This configuration semantically puts all tasks into one (priority) queue. In previous projects I (and presumably others) have relied on being able to provision dedicated capacity with things split into multiple named queues, even if there's a default. Having a way to (1) specify this portably and (2) implement it with the DatabaseBackend would be fairly important for me. YMMV.

Reliability semantics and retrying

For automatic task retrying, whenever I've worked with queue infrastructure which doesn't have reliability, I've eventually found it necessary to emulate it. If this has been others' experience (and I'm not a terribly interesting developer so if I've experienced this I suspect a lot of people have), it would be good to take the opportunity here?

It also affects the design of the interface, so IMO it's at least worth spelling out how this would work even if the implementations are left for later.

When a task fails, it's necessary to have some way of signalling whether or not it should be retried. If my email sending task fails because the SMTP server is down, I probably want to retry it; if it fails because the email address is invalid, or because I can't actually decode the task data due to a Pickle upgrade, I probably don't want to retry it. The specification here doesn't have a way to portably signal that though, and TaskStatus has only Failed in the portable subset, where it may be useful to differentiate retryable and non-retryable failures.

Task status signalling in DatabaseBackend

Given portable transactions, in the DatabaseBackend, what are your expectations around when and how the task runner locks the task instances? If the transaction for a task which is Running fails, how do you update its status afterwards from Running to Failed without a race condition?

Entry cleanup in DatabaseBackend

Task records are presumably not left around forever otherwise the amount of data used grows without bound. What's the strategy to clean up tasks records for tasks that have completed, and particularly what's the strategy to do this without a race condition?

Final thoughts

I'm asking these questions not because I object to the proposal but exactly because I'd like to see this proposal succeed. Give me a shout if there's anything I can clarify or help with here.

prophile avatar Feb 09 '24 14:02 prophile

Thank you so much for putting this together, @RealOrangeOne! This has long been on my personal Django wishlist, so I'm absolutely stoked to see it moving forward.

I've just had a quick read, so don't have super-detailed comments, but I do have a couple of big-picture questions that I don't see adressed, and that I think need to have some answers:

  1. How "far" do you picture this framework going? That is: Celery (to pick the most full-featured Python queue) has just a huge ton of features. There's probably a point at which if you need some of those things, you should just be using Celery. I think having some idea of where the "line" is going to be would be helpful.

    Personally, I think whatever's built in to Django should aim to be super-simple, with minimal configuration, and a minimally-useful subset. Beyond that, I think we should be pushing people to more full-featured external libraries. That is, for me, a built-in task queue needs to handle two major use-cases:

    • Facilitate rapid early development - when you're starting a new app, having to set up a more complex task queue can be a barrier. Having something built-in means you can quickly do background email sending (or whatever) without too much faff. This is the "... with deadlines" part of the Django tagline, right?
    • Give third-party apps a common baseline, so they can expect a task queue without needing to commit to a specific dependency. This is the Wagtail use-case, I think, right?

    There may be other use-cases -- but the point is i think we need to enumerate them to avoid feeling like we need to build something feature-competitive with Celery or whatever.

  2. Related: what is the "transition path" for moving from the built-in task queue to something else? Some amount of refactoring is unavoidable and fine, but I think the DEP should address, at least a bit, how we expect users to transition if they outgrow what the built-in queue offers.

jacobian avatar Feb 09 '24 15:02 jacobian

How "far" do you picture this framework going?

No where near Celery, at least not for a while. The vast majority of this proposal is about writing the glue which sits between Django and tools like Celery, so developers and library maintainers have a single interface to using queues, without needing to worry about the implementation details.

The biggest risk of scope-creep for this is definitely the DatabaseBackend, as it actually tries to implement a task queue (which as @prophile outline above, is quite difficult). I agree that I see DatabaseBackend as being perfect for anything which needs a sensible and performant job queue. If people already know they need Celery, it's absolutely not the backend for them, and I don't think it should try to. With that said, I'd like to try and avoid intentionally nerfing the DatabaseBackend - just because it's optimised for smaller deployments, doesn't mean it shouldn't be capable of handling a very large number of tasks concurrently, and operate in a sensible and expected way. I suspect the target audience for DatabaseBackend and a SQL-based queue are fairly well aligned with those who may choose something like PostgreSQL FTS over something like ElasticSearch. Sure, ElasticSearch is probably better for those 10% of users who really need it, but doesn't mean the other 90% won't be perfectly happy with postgres, and probably wouldn't benefit from ElasticSearch anyway.

I think the DatabaseBackend both helps people get moving quickly (and scale up fairly far without needing to worry too much). Once someone needs to move, there's little application code which needs changing to move over to something like Celery should its wealth of features be needed.

what is the "transition path" for moving from the built-in task queue to something else

I think there are 2 things which would need to transition:

Code wise, I'd like there to be as close to 0 changes needed to swap out the backend (up to a certain point). If implementation details are encoded into how tasks are queued (eg the need for transaction.on_commit for non-DB queues), then that may need changing. Otherwise, I see a world where the only changes are those in settings.py, much like we have now with django.core.cache.

As for the queue storage itself, I think that's a little more complex. Exactly how to say move a bunch of tasks from RQ to Celery and RabbitMQ is probably quite a difficult story, which doesn't have an obvious answer. However, I don't think that needs to be on Django to solve, since it's only providing the glue between the runner and Django, as opposed to a library for other task runners to integrate with (Other than implementing this interface, I'd like there to be no changes Celery would need to make to support this). The easiest transition path might be running both queues in parallel for a while until the first queue is empty, and then stopping the runners.

RealOrangeOne avatar Feb 09 '24 15:02 RealOrangeOne

I love this, thank you @RealOrangeOne. A couple of thoughts:

Passing a plain callable to enqueue and defer is a nice interface but it has some downsides. We can only really enqueue/delay globally importable functions: no lambdas, no local functions, etc. Handling this at the callsite or the receiver both have downsides, and it’s generally confusing to new users. What do we do when a user tries to enqueue the callable some_model_instance.send_email?

Should we add some concept of ownership/queue/routing to the task model itself? Given a task ID how do I know what queue it was sent to, or how would I send a task to a specific queue? Defining lots of duplicate backends (1 per queue) seems redundant and not very performant.

An execution of a task is a separate thing to the task itself - a task might have 0 or more executions (i.e retries, timeouts, etc). The current model seems to mix both of these together - is it worth making a distinction?

Can we/should we tie enqueuing into “transaction.on_commit” by default? Seems like a big footgun people will run into almost immediately.

I personally hate enqueue and defer - they are pretty widely used, but run_later and run_at seem more approachable. Just my personal opinion though, but I wonder if anyone else agrees.

orf avatar Feb 09 '24 18:02 orf

I suspect the target audience for DatabaseBackend and a SQL-based queue are fairly well aligned with those who may choose something like PostgreSQL FTS over something like ElasticSearch. Sure, ElasticSearch is probably better for those 10% of users who really need it, but doesn't mean the other 90% won't be perfectly happy with postgres, and probably wouldn't benefit from ElasticSearch anyway.

This is fantastic framing, and something that should probably end up in the eventually docs for this. Super-well-said!

jacobian avatar Feb 10 '24 21:02 jacobian

We can only really enqueue/delay globally importable functions

I don't think I can encode this in the type signature, but yes the function will need to be global. I suspect we may need to validate this just before triggering. Whilst certain backends may support it, most won't.

Should we add some concept of ownership/queue/routing to the task model itself

Yes, absolutely we should. A Task should probably know all the attributes which were passed to the enqueue call which triggered it.

An execution of a task is a separate thing to the task itself

Currently, these are one and the same. The current implementation doesn't support retries directly, so would likely be separate tasks.

Can we/should we tie enqueuing into “transaction.on_commit” by default?

I don't think so. It's a foot-gun in some cases, but might end up being confusing in others. In some cases, you might not care about having objects committed to the DB, so it just delays the task unnecessarily. I think in the first-pass, we can solve this with documentation.

I personally hate enqueue and defer

Naming things is hard. I'm not attached to them by any means.

RealOrangeOne avatar Feb 13 '24 12:02 RealOrangeOne

Should we add some concept of ownership/queue/routing to the task model itself

Yes, absolutely we should. A Task should probably know all the attributes which were passed to the enqueue call which triggered it.

I think that queuing under the backend interface described would rightly be an implementation detail of the backend, and can actually be reasonably capable. The backend, for example, can separate callables into queues based on the import path. That would get a lot of the projects I've worked on a good deal of mileage, to support the common case of separating long-running callables from short-lived callables.

If ownership/routing is anything like exchanges and routing keys in Celery / AMQP, I think those are advanced use-cases that we don't want to get into.

ryanhiebert avatar Feb 25 '24 07:02 ryanhiebert

I would rather we don’t use Pickle for argument serialization, and instead use json. There are too many problems with pickle:

  1. Protocol version changes between Python versions complicate upgrading, since you need to lose tasks or temporarily try both versions.
  2. Enqueueing data rather than references - That is, enqueueing a task with a model instance, rather than its ID. If you save the queued model instance within the task, it erases any intermediate changes in the database. I’ve seen this problem so many times that I put it first in my post “Common Issues Using Celery (And Other Task Queues)”.
  3. Security problems - per the big red box in the pickle docs. Yes, the task storage should be trusted, just like cache storage, but if we can just avoid pickle altogether there will never be an issue.
  4. Slower performance - pickle makes it easy to put lots of data into a task argument when you only need a bit, such as a whole model instance when you only need an ID. It’s also not the fastest to deserialize because it runs a whole stack machine.

The serialization should be swappable, perhaps by subclassing and replacing the serialize/deserialize methods. That way users can use pickle or whatever if they really need to.

adamchainz avatar Apr 04 '24 14:04 adamchainz

The serialization should be swappable

I had intentionally not included support for this. If a given backend doesn't support pickle-based serialization, shoe-horning it in can be quite difficult and unpredictable. If a library maintainer also assumes pickle serialization, their library won't work in environments where the pickle constraints do matter. And that's before we think of a transition path.

Going with JSON (at least at first) has a few benefits:

  • It avoids the pickle issues
  • If it can be JSON serialized, it can be pickled (the inverse isn't true)
  • It bypasses a number of odd behaviours where someone accidentally passes something to a task which shouldn't be (ie it isn't thread safe or depends on external state like NamedTemporaryFile).
  • Keeping the format JSON serializable can be helpful when integrating with external monitoring tools

I'm not opposed to swappable serializers in future, but JSON only for the initial pass is probably the way to go. Being limiting and then expanding support is going to be better for adoption than being vague or potentially limiting in future.

RealOrangeOne avatar Apr 04 '24 16:04 RealOrangeOne

The serialization should be swappable, perhaps by subclassing and replacing the serialize/deserialize methods. That way users can use pickle or whatever if they really need to.

json as default serialization sidesteps some problems and avoids the issues around serializing instances, but not being able to serialize a datetime in an argument by default is quite unfortunate. I don't know whether it's worth a default workaround though.

prophile avatar Apr 07 '24 06:04 prophile