rabbitmq-server icon indicating copy to clipboard operation
rabbitmq-server copied to clipboard

Classic mirrored queue becomes unresponsive on lazy queue policy change

Open lukebakken opened this issue 2 years ago • 2 comments

Discussed in https://github.com/rabbitmq/rabbitmq-server/discussions/5095

See the following project for a script to reproduce the issue (requires asdf but it could be easily adapted to use kerl):

https://github.com/lukebakken/rabbitmq-server-5086

Note that the latest-versions branch exists, which demonstrates that the issue does not seem to exist using Erlang 25.0.2 and RabbitMQ 3.10.5.

Using recon, it seems like there may be a loop going on. Use rabbitmq_top to figure out which node(s) are having the excessive reduction counts, and connect to one with this command:

erl -remsh rabbit-2@shostakovich

Some diagnostic output:

(rabbit-2@shostakovich)14> recon:info({0,1839,0},messages).
{messages,[bump_reduce_memory_use,
           {'$gen_cast',{sync_start,#Ref<46249.850392274.3755737094.223971>,
                                    <46249.1789.0>}},
           emit_gen_server2_stats,
           {'$gen_cast',{run_backing_queue,rabbit_mirror_queue_master,
                                           #Fun<rabbit_mirror_queue_slave.9.51035377>}}]}

            {current_stacktrace,[{rabbit_queue_index,seq_id_to_seg_and_rel_seq_id,
                                                     1,
                                                     [{file,"rabbit_queue_index.erl"},{line,1005}]},
                                 {rabbit_queue_index,next_segment_boundary,1,
                                                     [{file,"rabbit_queue_index.erl"},{line,488}]},
                                 {rabbit_variable_queue,permitted_beta_count,1,
                                                        [{file,"rabbit_variable_queue.erl"},{line,2537}]},
                                 {rabbit_variable_queue,reduce_memory_use,1,
                                                        [{file,"rabbit_variable_queue.erl"},{line,2443}]},
                                 {rabbit_variable_queue,resume,1,
                                                        [{file,"rabbit_variable_queue.erl"},{line,914}]},
                                 {rabbit_variable_queue,convert_to_lazy,1,
                                                        [{file,"rabbit_variable_queue.erl"},{line,1024}]},
                                 {rabbit_variable_queue,set_queue_mode,2,
                                                        [{file,"rabbit_variable_queue.erl"},{line,991}]},
                                 {rabbit_priority_queue,set_queue_mode,2,
                                                        [{file,"rabbit_priority_queue.erl"},{line,448}]}]}]},

...and this:

            {current_stacktrace,[{lqueue,len,1,
                                         [{file,"lqueue.erl"},{line,92}]},
                                 {rabbit_variable_queue,reduce_memory_use,1,
                                                        [{file,"rabbit_variable_queue.erl"},{line,2448}]},
                                 {rabbit_variable_queue,resume,1,
                                                        [{file,"rabbit_variable_queue.erl"},{line,914}]},
                                 {rabbit_variable_queue,convert_to_lazy,1,
                                                        [{file,"rabbit_variable_queue.erl"},{line,1024}]},
                                 {rabbit_variable_queue,set_queue_mode,2,
                                                        [{file,"rabbit_variable_queue.erl"},{line,991}]},
                                 {rabbit_priority_queue,set_queue_mode,2,
                                                        [{file,"rabbit_priority_queue.erl"},{line,448}]},
                                 {rabbit_mirror_queue_slave,process_instruction,2,
                                                            [{file,"rabbit_mirror_queue_slave.erl"},{line,1042}]},
                                 {rabbit_mirror_queue_slave,handle_cast,2,
                                                            [{file,"rabbit_mirror_queue_slave.erl"},{line,299}]}]}]},

Originally posted by b-borden June 21, 2022 Hi there,

I am experiencing a stuck classic mirrored queue on my test RabbitMQ 3.9.20 cluster broker due to a queue’s effective policy change. While stuck, the queue blocks both publishing and consumption of messages and is listed as unresponsive by rabbitmqctl list_unresponsive_queues. Otherwise I can still query the queue metadata from the management API, which reports the queue as having “running” state.

When this happens, scheduler CPU utilization increases and stays elevated. Observing rabbitmq_top, it appears that RabbitMQ has a high number of reductions on the queue’s mirror_queue_slave process and is unable to complete rabbit_variable_queue:convert_to_lazy.

Reproduction

It can be reproduced by deleting the policy in use by a queue, causing the queue to apply a lower priority policy that has lazy queues enabled:

  1. Start a cluster of two 3.9.20 nodes. Node configurations below.
  2. Apply the following HA queue policies:
    1. rabbitmqctl -n node1@localhost set_policy policy-0 ".*" '{"ha-mode":"all", "ha-sync-mode": "automatic", "queue-mode": "lazy"}' —priority 0
    2. rabbitmqctl -n node1@localhost set_policy policy-1 ".*" '{"ha-mode":"all", "ha-sync-mode": "automatic"}' —priority 1
  3. Create a durable classic mirrored queue
  4. Publish 1 million messages to the queue (less reliably reproduced with fewer messages, possibly as low as ~300k):
    1. delivery mode: persistent
    2. size: 800 Bytes / msg
  5. Remove the higher priority policy:
    1. rabbitmqctl -n node1@localhost clear_policy policy-1

Node1 configuration:

listeners.tcp.default = 127.0.0.1:5671

management.tcp.ip   = 127.0.0.1
management.tcp.port = 15671

vm_memory_high_watermark.absolute = 1G

cluster_formation.peer_discovery_backend = classic_config
cluster_formation.classic_config.nodes.1 = node1@localhost
cluster_formation.classic_config.nodes.2 = node2@localhost

Node2 configuration:

listeners.tcp.default = 127.0.0.1:5672

management.tcp.ip   = 127.0.0.1
management.tcp.port = 15672

vm_memory_high_watermark.absolute = 1G

cluster_formation.peer_discovery_backend = classic_config
cluster_formation.classic_config.nodes.1 = node1@localhost
cluster_formation.classic_config.nodes.2 = node2@localhost

lukebakken avatar Jun 28 '22 19:06 lukebakken

For the record I could reproduce as well using your branch and I let it run long enough to be sure it's an infinite loop. It may not occur on 3.10 due to the many changes it had. Also I'll be on PTO next week.

lhoguin avatar Jul 22 '22 16:07 lhoguin

@dcarwin-pivotal @michaelklishin I confirmed just now that this issue does not apply to RabbitMQ 3.10.6 (tested with Erlang 25.0.3).

lukebakken avatar Jul 22 '22 20:07 lukebakken

Closing since RabbitMQ 3.10.x fixes this issue.

lukebakken avatar Oct 17 '22 13:10 lukebakken