dramatiq icon indicating copy to clipboard operation
dramatiq copied to clipboard

Add support for ack/nack results so middleware can manage.

Open jenstroeger opened this issue 2 years ago • 3 comments

This PR is a draft and discussion. I’m still testing it, and wanted to gather feedback early.

Background

When using distributed RabbitMQ hosted by AmazonMQ’s service then connection losses are a regular problem due to Amazon’s maintenance window: hosts shut down whether messages are queued or in flight, which in turn triggers automatic message requeueing and failing message ack due to delivery tag changes after the connection reestablishes.

Proposed implementation

I think a Dramatiq middleware is able to manage these scenarios, even for stateful and non-transacted actor functions. However, the middleware needs to know whether a message was ack’ed successfully or not, which is what this PR addresses. I added that ack/nack result as a kwarg to maintain compatibility with existing middlewares.

@Bogdanp, I’m curious what you make of this change, and your experience with & thoughts on the mentioned AmazonMQ service (if any).

jenstroeger avatar Aug 13 '21 01:08 jenstroeger

I think there are two problems with this change, both related to backwards-compatibility:

  • After this change, external brokers (like dramatiq_sqs or dramatiq-pg) will incorrectly report that "no action was taken" after ack/nack. At least those two should also be updated.
  • Making the new ack/nack args keywords is not enough for backwards-compatibility. Passing in an unhandled kwarg to an existing middleware will end up raising a TypeError. To preserve backwards-compatibility, we'd have to detect whether or not the middleware handlers accept the new kwargs, which seems like it could get a little ugly.

I have no experience with AmazonMQ so I can't help there. With other managed AWS services that have a maintenance window, I've generally turned it off and handled upgrading manually precisely because I wanted to avoid these sorts of unexpected service interruptions.

Bogdanp avatar Aug 16 '21 05:08 Bogdanp

After this change, external brokers (like dramatiq_sqs or dramatiq-pg) will incorrectly report that "no action was taken" after ack/nack. At least those two should also be updated.

Yup, I had to adjust the existing Redis, Stub, and RabbitMQ brokers. I’m happy to provide a PR for those two brokers as well.

Also, I think that success=None should be interpreted as “Not known whether message was n/acked” instead of “No action was taken”.

Making the new ack/nack args keywords is not enough for backwards-compatibility. […] it could get a little ugly.

I agree, and it’s not a route I’d want to go. This PR is a breaking change and as per conventional commits would warrant a major version bump. I have a feeling that you may not be open to it at this point 😉 Perhaps this change is better suited together with other features for a v2 release?

I have no experience with AmazonMQ so I can't help there. With other managed AWS services that have a maintenance window, I've generally turned it off and handled upgrading manually precisely because I wanted to avoid these sorts of unexpected service interruptions.

I’ve contacted AWS twice now and raised this issue, precisely because their decision of a forced maintenance breaks their commitment of high availability. But even if they do ack (pun intended) the problem and set out to fixing it, I’m not going to wait around because the problem is happening now.

The alternative to stabilizing async workers is to abandon AmazonMQ/RabbitMQ in favor of another queue implementation. @Bogdanp, what’s your experience with SQS (considering you implemented dramatiq_sqs)?

jenstroeger avatar Aug 16 '21 07:08 jenstroeger

Looks like v1.12 and its redelivered flag for RabbitMQ messages helps address the initial issue as well 🤓

jenstroeger avatar Oct 23 '21 08:10 jenstroeger