broadway_rabbitmq icon indicating copy to clipboard operation
broadway_rabbitmq copied to clipboard

ACK timeout kills connection without getting restarted

Open D4no0 opened this issue 4 years ago • 4 comments

versions: broadway: 1.0.0 bradway_rabbitmq: 0.7.0 amqp: 2.1 elixir: 1.12 otp: 24.0.5

I have some long-running tasks that sometime may time-out the consumer_timeout from rabbitmq with message:

09:38:26.044 [warn] AMQP channel went down with reason: {:shutdown, {:server_initiated_close, 406, "PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 7200000 ms. This timeout value can be configured, see consumers doc guide to learn more"}}

The expected behavior would be to reestablish a new connection, kill the timed-out processors and rabbitmq to redeliver messages.

The current behavior is that the GenServer is killed and broadway can no longer send messages to rabbitmq. This is fixed only by restarting the broadway process.

D4no0 avatar Sep 24 '21 09:09 D4no0

Thanks for the report. The log you are seeing immediately causes the client to reconnect, so i am assuming there is something more at play here: https://github.com/dashbitco/broadway_rabbitmq/blob/master/lib/broadway_rabbitmq/producer.ex#L527-L536

josevalim avatar Sep 24 '21 10:09 josevalim

The error after that is related to a genserver call, with the genserver down:

07:57:38.686 [error] ** (exit) exited in: :gen_server.call(#PID<0.6829.0>, {:call, {:"basic.ack", 6, false}, :none, #PID<0.2649.0>}, 70000) ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

D4no0 avatar Sep 27 '21 08:09 D4no0

So, we ack from a different process than the RabbitMQ producer (the processor or batcher acks). I don't think we can "save" the ack if the channel is down. What we can do, however, is have a better error message from Broadway, which is what I did with #122. I think for now that's pretty much it. 😞 Eventually the producer should reconnect.

whatyouhide avatar Feb 16 '23 07:02 whatyouhide

We fall into related issue: ack timeout -> channel closed by rabbitMQ server -> while broadway reconnects there is several log messages about unable to ack/reject messages because of dead channel -> more ack timeouts growing every 30 minutes (default rabbitMQ consumer timeout) -> eventually we have a lot of channel reconnects but worst thing is that it appears rabbitMQ will keep all mnesia segments containing unacked messages, with 30 minutes timeout and high throughput it eats disk space pretty wild. We are going to try short timeout as our ingestion is intended to be pretty fast.

Regarding the topic: does it makes any sense to retry ack/reject several times when channel is not alive? Another option would be to at least give some control over messages broadway is unable to ack/reject, something like handle_ack_error or so.

v-anyukov avatar Apr 12 '24 18:04 v-anyukov