broadway_rabbitmq
broadway_rabbitmq copied to clipboard
ACK timeout kills connection without getting restarted
versions: broadway: 1.0.0 bradway_rabbitmq: 0.7.0 amqp: 2.1 elixir: 1.12 otp: 24.0.5
I have some long-running tasks that sometime may time-out the consumer_timeout from rabbitmq with message:
09:38:26.044 [warn] AMQP channel went down with reason: {:shutdown, {:server_initiated_close, 406, "PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 7200000 ms. This timeout value can be configured, see consumers doc guide to learn more"}}
The expected behavior would be to reestablish a new connection, kill the timed-out processors and rabbitmq to redeliver messages.
The current behavior is that the GenServer is killed and broadway can no longer send messages to rabbitmq. This is fixed only by restarting the broadway process.
Thanks for the report. The log you are seeing immediately causes the client to reconnect, so i am assuming there is something more at play here: https://github.com/dashbitco/broadway_rabbitmq/blob/master/lib/broadway_rabbitmq/producer.ex#L527-L536
The error after that is related to a genserver call, with the genserver down:
07:57:38.686 [error] ** (exit) exited in: :gen_server.call(#PID<0.6829.0>, {:call, {:"basic.ack", 6, false}, :none, #PID<0.2649.0>}, 70000) ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
So, we ack from a different process than the RabbitMQ producer (the processor or batcher acks). I don't think we can "save" the ack if the channel is down. What we can do, however, is have a better error message from Broadway, which is what I did with #122. I think for now that's pretty much it. 😞 Eventually the producer should reconnect.
We fall into related issue: ack timeout -> channel closed by rabbitMQ server -> while broadway reconnects there is several log messages about unable to ack/reject messages because of dead channel -> more ack timeouts growing every 30 minutes (default rabbitMQ consumer timeout) -> eventually we have a lot of channel reconnects but worst thing is that it appears rabbitMQ will keep all mnesia segments containing unacked messages, with 30 minutes timeout and high throughput it eats disk space pretty wild. We are going to try short timeout as our ingestion is intended to be pretty fast.
Regarding the topic: does it makes any sense to retry ack/reject several times when channel is not alive? Another option would be to at least give some control over messages broadway is unable to ack/reject, something like handle_ack_error or so.