rabbitmq-server
rabbitmq-server copied to clipboard
Churn in classic mirrored DLQ
We observe a potential bug in our brokers
Classic mirrored DLQ replica, at a certain load, flips into CPU and memory churn and cannot recover. First reported here and later confirmed on multiple RabbitMQ versions.
With appropriate data set the instance is OOM-killed, but even on smaller sets I had the broker crossing into red and back for a long time after the test.
Reproduction
Environment
> cat ../rmq1.conf
listeners.tcp.default = 127.0.0.1:5671
management.tcp.port = 15671
cluster_formation.peer_discovery_backend = classic_config
cluster_formation.classic_config.nodes.1 = rmq2@3c22fb6527ee
> cat ../rmq2.conf
listeners.tcp.default = 127.0.0.1:5672
management.tcp.port = 15672
cluster_formation.peer_discovery_backend = classic_config
cluster_formation.classic_config.nodes.1 = rmq1@3c22fb6527ee
> echo $RABBITMQ_ENABLED_PLUGINS
rabbitmq_management
> alias brun "bazel run --config=local"
Two brokers
brun broker RABBITMQ_CONFIG_FILE=(realpath ../rmq1.conf) RABBITMQ_NODENAME=rmq1 RABBITMQ_DIST_PORT=25671
brun broker RABBITMQ_CONFIG_FILE=(realpath ../rmq2.conf) RABBITMQ_NODENAME=rmq2 RABBITMQ_DIST_PORT=25672
brun -- rabbitmqctl -n rmq1@3c22fb6527ee set_policy ha "." '{"ha-mode":"all","ha-sync-mode":"automatic","queue-mode":"lazy"}' --apply-to queues
Long message backlog nacked between queues
./bin/runjava com.rabbitmq.perf.PerfTest -u "input" -h amqp://localhost:5671 --auto-delete false -f persistent --queue-args x-dead-letter-exchange=,x-dead-letter-routing-key=dlq -x 4 -y 0 -s 1000 -pmessages 250000
# "dlq" created beforehand, say in web console
./bin/runjava com.rabbitmq.perf.PerfTest -u "input" -h amqp://localhost:5671 --auto-delete false -f persistent --queue-args x-dead-letter-exchange=,x-dead-letter-routing-key=dlq -x 0 -y 10 -na -re false
Observations
Slave replica memory footprint stays abnormally high, oscillating +/- 2GB, for many hours (so far up to 4) after the test. There is a significant uptick in CPU as well.
queue_slave_procs: 13.3358 gb (62.38%)
binary: 6.2803 gb (29.37%)
allocated_unused: 1.6457 gb (7.7%)
other_proc: 0.0529 gb (0.25%)
code: 0.0348 gb (0.16%)
other_system: 0.0195 gb (0.09%)
other_ets: 0.0037 gb (0.02%)
plugins: 0.0031 gb (0.01%)
atom: 0.0014 gb (0.01%)
metrics: 0.0014 gb (0.01%)
mgmt_db: 0.001 gb (0.0%)
connection_other: 0.0001 gb (0.0%)
mnesia: 0.0001 gb (0.0%)
msg_index: 0.0 gb (0.0%)
quorum_ets: 0.0 gb (0.0%)
queue_procs: 0.0 gb (0.0%)
quorum_queue_dlx_procs: 0.0 gb (0.0%)
stream_queue_procs: 0.0 gb (0.0%)
stream_queue_replica_reader_procs: 0.0 gb (0.0%)
connection_readers: 0.0 gb (0.0%)
connection_writers: 0.0 gb (0.0%)
connection_channels: 0.0 gb (0.0%)
quorum_queue_procs: 0.0 gb (0.0%)
stream_queue_coordinator_procs: 0.0 gb (0.0%)
reserved_unallocated: 0.0 gb (0.0%)
From the couple hours I spent on this:
- Confirmed as low as 1M messages. On my laptop 3M is enough to trigger memory alarm.
- Confirmed for 1-20 consumers.
- Confirmed for messages as small as 1 byte.
- The broker recovers after 1-2 minutes of churn
- at 250k messages
- non-lazy queues
- after deleting "dlq"
And a note from @thuandb (queue depth mistmatch because of a different env):
It is confirmed that IO is the bottleneck causing high memory utilization during queue sync. Using gp3 with maximum throughput (1000MB/s) helps with large messages. Message size: 640k bytes Queue depth: 115k messages sync batch size: 256. Memory only slightly increase @ mirrors.
This may or may not be related, the symptoms here reminding me of this message - https://groups.google.com/g/rabbitmq-users/c/-_J_K4T7yTI/
What RabbitMQ version was used?
My test was run from a somewhat stale master, at e2edb1b. In prod this was observed on 3.8.22 and 3.8.27.
Updated description to remove the mention of allocs: I haven't actually looked at allocations yet. Memory may sway because of paging.
I have put together a reproduction project here:
https://github.com/lukebakken/rabbitmq-server-5312
It uses RMQ 3.10.6 and the latest PerfTest.
I can reproduce the issue as described.
Workarounds that address the issue:
- Make
dlqa quorum queue (https://github.com/lukebakken/rabbitmq-server-5312/tree/dlq-as-quorum) - Make all queues quorum queues (https://github.com/lukebakken/rabbitmq-server-5312/tree/all-as-quorum)
Using CQv2 or quorum queue solves the problem => closing.
We still have this issue with v2 classic queues. Will try to reproduce next week.
Putting this here since the issue seems to be the same - https://groups.google.com/g/rabbitmq-users/c/Kr2QZx1TREA
So in total, two reports about this still happening in CMQv2? @johanrhodin Looking forward to reproduction steps as well.
I've also found another workaround - switching queue-mode from normal to lazy for queues that were causing problems solved the issue for me.
FWIW I just revisited this using the latest main code for RabbitMQ and PerfTest. Turns out now I can reproduce this using mirrored classic v2 queues -
As expected quorum queues do not show this issue (using the latest main for RabbitMQ and PerfTest as well).
cc @michaelklishin any reason to keep this open at this point?
We can leave it open if some CMQ user would like to investigate and contribute a fix.
What immediately stands out on the screenshot is the gen_server2 buffer size for that CMQ mirror replica.
I believe the problem comes from the way at-most-once dead lettering works. For quorum queues, we added at-least-once strategy that uses publisher confirms internally, but classic queues don't support that. When a lot of messages are dead-letter, we are effectively publishing to a mirrored queue without any flow-control mechanism, which overloads the DLQ process, which causes a memory spike.
Assuming that's true, we would need to have classic queues support at-least-once dead-lettering, which is something that CQv2 should probably support at some point, but whether the same implementation would work for CQv1 and how much work would need to be done to support it with mirroring, I'm not sure.
Alternatively, perhaps it's possible to apply back-pressure using some other mechanism, to prevent DLQ getting overloaded.
Feel free to have a look at QQ at-least-once support PR for inspiration:
https://github.com/rabbitmq/rabbitmq-server/issues/3100
@michaelklishin @mkuratczyk interestingly enough, the gen_server2 internal priority queue (gen_server2 buffer in the top UI) does eventually drain, restoring memory usage to normal levels.
~Sorry, I think I jumped to a conclusion here. The process that takes so much memory is the input queue, not the dlq. Given that, my theory that we are overloading the dlq by publishing to it without any flow control is wrong. Which doesn't mean the current naive CQ dead-lettering process isn't the culprit, but I it's definitely not what I thought it was.~
Scratch that. I must have run the test incorrectly (I guess I was still publishing to the input queue). Now I see the dlq queue using lots of memory, which is what I'd expect.
@illotum @johanrhodin could you please test the cmq-optimisations branch, also available as an OCI image pivotalrabbitmq/rabbitmq:cmq-optimisations-otp-max-bazel/?
For me it roughly doubles the publish rate during the first phase of the test and, I assume, when messages are published internally to the dlq. Thanks to this, it seems to also lower the memory usage by at least half. In my system, that's a difference between OOMkill and a smooth operations, albeit still with a relatively high memory usage. Is such an improvement sufficient?
Note: when using CQv2, it's also useful to increase the dist_buf_busy_limit. CQv2 is so much faster that it can easily overload the distribution channel and then we have a "hiccup" during the consumption phase.
Tested with CQv1, CQv1 lazy and CQv2. Each of these had a memory alarm/oomkill using the main branch while the results with this branch are better (although, again, the memory used is still high):

One more thing worth sharing: ultimately the problem is that we don't use publisher confirms when internally publishing to classic queues (which is what happens during dead-lettering). I came up with a similar scenario without mirroring - put the input queue on one node and the dlq on another node and follow the same steps, just without the mirroring policy. Since mirroring makes this even worse (is responsible for part of the memory usage), you will need to increase the number/size of the messages but there are combinations of values that will lead to the dead letter queue being overloaded, even without mirroring.
I think all of this just proves what we knew: we need to have publisher confirms / flow control for internal publishing/dead-lettering to a classic queues, just like we now do for quorum queues.
@mkuratczyk your branch shows significant improvement across all my tests. I can still see the spike, but it's much harder to cross into alarm now.
I have made one more commit to the branch to bring back some parts I removed previously, as it showed some performance regression with larger (not embedded) messages. I will continue testing next week but I'd also appreciate another round of testing on your side. Especially with workloads that are different than the reproduction steps of this issue, as the memory usage reduction in this scenario is significant and consistent, but the change affects mirroring in general, so I want to be careful with this, as lots of users may be affected.
Not too rigorous, but I've been running various sync scenarios over the last couple days: cluster member loss, DLQ, lazy and not, various message sizes. Tests show similar or better profile in all cases (often insignificant), so far I see no degradation.
@mkuratczyk anything else I can help with re your performance branch?
@illotum the core team will discuss what's ahead for that branch. Thank you the details and not letting this slip.
@illotum What versions would you expect this to be back-ported to? While we didn't find any regressions, we know RabbitMQ is used in so many ways/configurations that it could have some negative impact in some of them, so we are wary of back-porting this/releasing in a patch release (especially for "older" versions like 3.9 where very few changes happen at this point).
I'd be happy with 3.10, and fine with 3.11. We are only rolling out 3.11, so this would be a good release to pick as first.
3.11 is what the core team had in mind.
If it's possible for 3.10, I'd very happy :)
@illotum My suggestion would be to ship it in 3.11 for now. Start using it, keep an eye on these instances and then, if you see the benefits and don't see regressions, we could consider backporting to 3.10.
I've opened a PR and shared some results of today's tests https://github.com/rabbitmq/rabbitmq-server/pull/6467
@illotum My suggestion would be to ship it in 3.11 for now. Start using it, keep an eye on these instances and then, if you see the benefits and don't see regressions, we could consider backporting to 3.10.
@mkuratczyk I would like to re-visit the idea of back-porting this to the 3.10 branch! In terms of getting feedback from us (AWS) in terms of regression (and benefits), that feedback will be easier to provide if we get this in the 3.10 branch as we have a larger adoption on 3.10 currently in our fleet of brokers.
Are there anything particular you require in terms of feedback from us, in order to backport to 3.10 sooner, rather than later?