rabbitmq-server icon indicating copy to clipboard operation
rabbitmq-server copied to clipboard

Churn in classic mirrored DLQ

Open illotum opened this issue 2 years ago • 12 comments

We observe a potential bug in our brokers

Classic mirrored DLQ replica, at a certain load, flips into CPU and memory churn and cannot recover. First reported here and later confirmed on multiple RabbitMQ versions.

With appropriate data set the instance is OOM-killed, but even on smaller sets I had the broker crossing into red and back for a long time after the test.

Screen Shot 2022-07-20 at 7 58 47 PM

Reproduction

Environment

> cat ../rmq1.conf
listeners.tcp.default = 127.0.0.1:5671
management.tcp.port = 15671
cluster_formation.peer_discovery_backend = classic_config
cluster_formation.classic_config.nodes.1 = rmq2@3c22fb6527ee

> cat ../rmq2.conf
listeners.tcp.default = 127.0.0.1:5672
management.tcp.port = 15672
cluster_formation.peer_discovery_backend = classic_config
cluster_formation.classic_config.nodes.1 = rmq1@3c22fb6527ee

> echo $RABBITMQ_ENABLED_PLUGINS
rabbitmq_management

> alias brun "bazel run --config=local"

Two brokers

brun broker RABBITMQ_CONFIG_FILE=(realpath ../rmq1.conf) RABBITMQ_NODENAME=rmq1 RABBITMQ_DIST_PORT=25671
brun broker RABBITMQ_CONFIG_FILE=(realpath ../rmq2.conf) RABBITMQ_NODENAME=rmq2 RABBITMQ_DIST_PORT=25672
brun -- rabbitmqctl -n rmq1@3c22fb6527ee set_policy ha "." '{"ha-mode":"all","ha-sync-mode":"automatic","queue-mode":"lazy"}' --apply-to queues

Long message backlog nacked between queues

./bin/runjava com.rabbitmq.perf.PerfTest -u "input" -h amqp://localhost:5671 --auto-delete false -f persistent --queue-args x-dead-letter-exchange=,x-dead-letter-routing-key=dlq -x 4 -y 0 -s 1000 -pmessages 250000

# "dlq" created beforehand, say in web console
./bin/runjava com.rabbitmq.perf.PerfTest -u "input" -h amqp://localhost:5671 --auto-delete false -f persistent --queue-args x-dead-letter-exchange=,x-dead-letter-routing-key=dlq -x 0 -y 10 -na -re false

Observations

Slave replica memory footprint stays abnormally high, oscillating +/- 2GB, for many hours (so far up to 4) after the test. There is a significant uptick in CPU as well.

queue_slave_procs: 13.3358 gb (62.38%)
binary: 6.2803 gb (29.37%)
allocated_unused: 1.6457 gb (7.7%)
other_proc: 0.0529 gb (0.25%)
code: 0.0348 gb (0.16%)
other_system: 0.0195 gb (0.09%)
other_ets: 0.0037 gb (0.02%)
plugins: 0.0031 gb (0.01%)
atom: 0.0014 gb (0.01%)
metrics: 0.0014 gb (0.01%)
mgmt_db: 0.001 gb (0.0%)
connection_other: 0.0001 gb (0.0%)
mnesia: 0.0001 gb (0.0%)
msg_index: 0.0 gb (0.0%)
quorum_ets: 0.0 gb (0.0%)
queue_procs: 0.0 gb (0.0%)
quorum_queue_dlx_procs: 0.0 gb (0.0%)
stream_queue_procs: 0.0 gb (0.0%)
stream_queue_replica_reader_procs: 0.0 gb (0.0%)
connection_readers: 0.0 gb (0.0%)
connection_writers: 0.0 gb (0.0%)
connection_channels: 0.0 gb (0.0%)
quorum_queue_procs: 0.0 gb (0.0%)
stream_queue_coordinator_procs: 0.0 gb (0.0%)
reserved_unallocated: 0.0 gb (0.0%)

From the couple hours I spent on this:

  1. Confirmed as low as 1M messages. On my laptop 3M is enough to trigger memory alarm.
  2. Confirmed for 1-20 consumers.
  3. Confirmed for messages as small as 1 byte.
  4. The broker recovers after 1-2 minutes of churn
    • at 250k messages
    • non-lazy queues
    • after deleting "dlq"

illotum avatar Jul 21 '22 19:07 illotum

And a note from @thuandb (queue depth mistmatch because of a different env):

It is confirmed that IO is the bottleneck causing high memory utilization during queue sync. Using gp3 with maximum throughput (1000MB/s) helps with large messages. Message size: 640k bytes Queue depth: 115k messages sync batch size: 256. Memory only slightly increase @ mirrors.

illotum avatar Jul 21 '22 19:07 illotum

This may or may not be related, the symptoms here reminding me of this message - https://groups.google.com/g/rabbitmq-users/c/-_J_K4T7yTI/

lukebakken avatar Jul 21 '22 20:07 lukebakken

What RabbitMQ version was used?

michaelklishin avatar Jul 21 '22 20:07 michaelklishin

My test was run from a somewhat stale master, at e2edb1b. In prod this was observed on 3.8.22 and 3.8.27.

illotum avatar Jul 21 '22 21:07 illotum

Updated description to remove the mention of allocs: I haven't actually looked at allocations yet. Memory may sway because of paging.

illotum avatar Jul 21 '22 21:07 illotum

I have put together a reproduction project here:

https://github.com/lukebakken/rabbitmq-server-5312

It uses RMQ 3.10.6 and the latest PerfTest.

I can reproduce the issue as described.

Workarounds that address the issue:

  • Make dlq a quorum queue (https://github.com/lukebakken/rabbitmq-server-5312/tree/dlq-as-quorum)
  • Make all queues quorum queues (https://github.com/lukebakken/rabbitmq-server-5312/tree/all-as-quorum)

lukebakken avatar Jul 23 '22 23:07 lukebakken

Using CQv2 or quorum queue solves the problem => closing.

michaelklishin avatar Jul 29 '22 09:07 michaelklishin

We still have this issue with v2 classic queues. Will try to reproduce next week.

johanrhodin avatar Aug 26 '22 21:08 johanrhodin

Putting this here since the issue seems to be the same - https://groups.google.com/g/rabbitmq-users/c/Kr2QZx1TREA

lukebakken avatar Sep 07 '22 17:09 lukebakken

So in total, two reports about this still happening in CMQv2? @johanrhodin Looking forward to reproduction steps as well.

illotum avatar Sep 07 '22 18:09 illotum

I've also found another workaround - switching queue-mode from normal to lazy for queues that were causing problems solved the issue for me.

danielllek avatar Sep 09 '22 06:09 danielllek

FWIW I just revisited this using the latest main code for RabbitMQ and PerfTest. Turns out now I can reproduce this using mirrored classic v2 queues -

image

As expected quorum queues do not show this issue (using the latest main for RabbitMQ and PerfTest as well).

cc @michaelklishin any reason to keep this open at this point?

lukebakken avatar Oct 20 '22 22:10 lukebakken

We can leave it open if some CMQ user would like to investigate and contribute a fix.

What immediately stands out on the screenshot is the gen_server2 buffer size for that CMQ mirror replica.

michaelklishin avatar Oct 21 '22 05:10 michaelklishin

I believe the problem comes from the way at-most-once dead lettering works. For quorum queues, we added at-least-once strategy that uses publisher confirms internally, but classic queues don't support that. When a lot of messages are dead-letter, we are effectively publishing to a mirrored queue without any flow-control mechanism, which overloads the DLQ process, which causes a memory spike.

Assuming that's true, we would need to have classic queues support at-least-once dead-lettering, which is something that CQv2 should probably support at some point, but whether the same implementation would work for CQv1 and how much work would need to be done to support it with mirroring, I'm not sure.

Alternatively, perhaps it's possible to apply back-pressure using some other mechanism, to prevent DLQ getting overloaded.

Feel free to have a look at QQ at-least-once support PR for inspiration: https://github.com/rabbitmq/rabbitmq-server/issues/3100

mkuratczyk avatar Oct 21 '22 16:10 mkuratczyk

@michaelklishin @mkuratczyk interestingly enough, the gen_server2 internal priority queue (gen_server2 buffer in the top UI) does eventually drain, restoring memory usage to normal levels.

lukebakken avatar Oct 21 '22 17:10 lukebakken

~Sorry, I think I jumped to a conclusion here. The process that takes so much memory is the input queue, not the dlq. Given that, my theory that we are overloading the dlq by publishing to it without any flow control is wrong. Which doesn't mean the current naive CQ dead-lettering process isn't the culprit, but I it's definitely not what I thought it was.~

Scratch that. I must have run the test incorrectly (I guess I was still publishing to the input queue). Now I see the dlq queue using lots of memory, which is what I'd expect.

mkuratczyk avatar Oct 21 '22 22:10 mkuratczyk

@illotum @johanrhodin could you please test the cmq-optimisations branch, also available as an OCI image pivotalrabbitmq/rabbitmq:cmq-optimisations-otp-max-bazel/?

For me it roughly doubles the publish rate during the first phase of the test and, I assume, when messages are published internally to the dlq. Thanks to this, it seems to also lower the memory usage by at least half. In my system, that's a difference between OOMkill and a smooth operations, albeit still with a relatively high memory usage. Is such an improvement sufficient?

Note: when using CQv2, it's also useful to increase the dist_buf_busy_limit. CQv2 is so much faster that it can easily overload the distribution channel and then we have a "hiccup" during the consumption phase.

Tested with CQv1, CQv1 lazy and CQv2. Each of these had a memory alarm/oomkill using the main branch while the results with this branch are better (although, again, the memory used is still high): Screenshot 2022-10-25 at 15 59 11

mkuratczyk avatar Oct 25 '22 14:10 mkuratczyk

One more thing worth sharing: ultimately the problem is that we don't use publisher confirms when internally publishing to classic queues (which is what happens during dead-lettering). I came up with a similar scenario without mirroring - put the input queue on one node and the dlq on another node and follow the same steps, just without the mirroring policy. Since mirroring makes this even worse (is responsible for part of the memory usage), you will need to increase the number/size of the messages but there are combinations of values that will lead to the dead letter queue being overloaded, even without mirroring.

I think all of this just proves what we knew: we need to have publisher confirms / flow control for internal publishing/dead-lettering to a classic queues, just like we now do for quorum queues.

mkuratczyk avatar Oct 26 '22 13:10 mkuratczyk

@mkuratczyk your branch shows significant improvement across all my tests. I can still see the spike, but it's much harder to cross into alarm now.

illotum avatar Oct 26 '22 21:10 illotum

I have made one more commit to the branch to bring back some parts I removed previously, as it showed some performance regression with larger (not embedded) messages. I will continue testing next week but I'd also appreciate another round of testing on your side. Especially with workloads that are different than the reproduction steps of this issue, as the memory usage reduction in this scenario is significant and consistent, but the change affects mirroring in general, so I want to be careful with this, as lots of users may be affected.

mkuratczyk avatar Oct 28 '22 11:10 mkuratczyk

Not too rigorous, but I've been running various sync scenarios over the last couple days: cluster member loss, DLQ, lazy and not, various message sizes. Tests show similar or better profile in all cases (often insignificant), so far I see no degradation.

illotum avatar Nov 02 '22 18:11 illotum

@mkuratczyk anything else I can help with re your performance branch?

illotum avatar Nov 22 '22 21:11 illotum

@illotum the core team will discuss what's ahead for that branch. Thank you the details and not letting this slip.

michaelklishin avatar Nov 23 '22 06:11 michaelklishin

@illotum What versions would you expect this to be back-ported to? While we didn't find any regressions, we know RabbitMQ is used in so many ways/configurations that it could have some negative impact in some of them, so we are wary of back-porting this/releasing in a patch release (especially for "older" versions like 3.9 where very few changes happen at this point).

mkuratczyk avatar Nov 23 '22 21:11 mkuratczyk

I'd be happy with 3.10, and fine with 3.11. We are only rolling out 3.11, so this would be a good release to pick as first.

illotum avatar Nov 23 '22 23:11 illotum

3.11 is what the core team had in mind.

michaelklishin avatar Nov 24 '22 06:11 michaelklishin

If it's possible for 3.10, I'd very happy :)

danielllek avatar Nov 24 '22 08:11 danielllek

@illotum My suggestion would be to ship it in 3.11 for now. Start using it, keep an eye on these instances and then, if you see the benefits and don't see regressions, we could consider backporting to 3.10.

mkuratczyk avatar Nov 24 '22 09:11 mkuratczyk

I've opened a PR and shared some results of today's tests https://github.com/rabbitmq/rabbitmq-server/pull/6467

mkuratczyk avatar Nov 24 '22 17:11 mkuratczyk

@illotum My suggestion would be to ship it in 3.11 for now. Start using it, keep an eye on these instances and then, if you see the benefits and don't see regressions, we could consider backporting to 3.10.

@mkuratczyk I would like to re-visit the idea of back-porting this to the 3.10 branch! In terms of getting feedback from us (AWS) in terms of regression (and benefits), that feedback will be easier to provide if we get this in the 3.10 branch as we have a larger adoption on 3.10 currently in our fleet of brokers.

Are there anything particular you require in terms of feedback from us, in order to backport to 3.10 sooner, rather than later?

SimonUnge avatar Dec 15 '22 21:12 SimonUnge