rabbitmq-server Improvement request: Faster Quorum Queue startup

Improvement request: Faster Quorum Queue startup

Open luos opened this issue 3 years ago • 17 comments

Hi,

Currently when a quorum queue starts up it will go through all the snapshot files:

https://github.com/rabbitmq/ra/blob/94ea25111b6f9f795e4f8b30fb3018b74426a4ec/src/ra_log.erl#L940

This is very inefficient for large queues, it can take minutes to read through all these one by one. it is not necessary that it's done serially, it could be done through a worker pool, or even queues could just start a few workers for themselves. Probably care would need to be taken so long queues don't hold up short queues from starting if the working pool solution is implemented.

On my machine a single queue does around 15-20 Megabyte / sec, while it should be possible to do 200 MB/sec (which I tried with more queues and it works).

Especially can be a problem after a "full" restart, which shows long QQs in "NaN" state until they recover.

Probably with some redesign this segref operation could be made to take a few milliseconds instead of reading all snapshots, if somehow it could be stored in the file header.

May 05 '21 12:05 luos

Quorum queues recover in parallel already (as you noted as multiple queues make better use of the disk) so I am not sure what this request is for. When recovering the quorum queue state machine needs to recover it's state and it does that by reading the last snapshot available then applying all entries since that point. So is this request for optimising single queue recovery?

Jun 09 '21 08:06 kjnilsson

Do you have any profiling data to suggest that recovering the segment references is what is taking time during queue recovery? if anything it would be reading the log from start to finish that would take time.

Jun 09 '21 09:06 kjnilsson

Hi,

Thanks for considering this.

I know they recover in parallel but for example if you have a full cluster failure then RabbitMQ will be up for minutes before a long queue is actually able to accept messages.

I did benchmark this and as mentioned this is what I got:

On my machine a single queue does around 15-20 Megabyte / sec, while it should be possible to do 200 MB/sec (which I tried with more queues and it works).

It is the part which I link to takes a long time, if you check ra_log_segment:open, it reads the whole file. Probably we are saying the same thing. What I saw is that progress on the segment files was slow but it may be that there is some log reading going on as well. I tested newest version just now and it's still doing around 10 MB / sec on startup.

Basically if you have a single queue which is long then the disk is nowhere near fully utilised, actually it's barely utilised at all. If you start multiple queues then you can see the total IO jumping to multiple of what a single queue does.

It also looks somewhat broken as on the GUI it just shows NaN and has no indication of progress.

I can't quite remember but I believe if a follower comes up it will also scan the segment files, which means if somehow you have a large queue and you do a restart, that follower will be basically out of service for minutes.

Jun 09 '21 09:06 luos

ra_log_segment:open does not read the whole file. It reads and recovers the fixed size segment index that resides at the start of the file. Later during state machine recovery it will read each entry and apply it to the state machine. Yes this isn't super fast and you are right and there is some inefficiency in ra_log_segment:open that we can improve.

Jun 09 '21 12:06 kjnilsson

in particular the call to filename:absname/1 is very slow and possibly not even needed.

Jun 09 '21 12:06 kjnilsson

This might help a bit: https://github.com/rabbitmq/ra/pull/222

It won't be a night and day difference but may improve certain scenarios

Jun 09 '21 13:06 kjnilsson

Thank you, I will check it out when I have some time though I think the biggest bottleneck is the "single threadedness" of the recovery procedure. Hopefully next week I can give you some more info where do I see the slowness.

Jun 09 '21 15:06 luos

@kjnilsson now has an improvement that looks so promising we find it too good to be true :) @luos would you have a chance to compare a branch of RabbitMQ server and/or Ra?

Jun 09 '21 15:06 michaelklishin

https://github.com/rabbitmq/ra/pull/223/ This is the PR but I am still experimenting quite heavily with it. for my test case of a 1M message QQ the recovery time went from 1 minute 16s to 16s, shaving off a good minute.

The approach however is detrimental to later log access patterns so I need to amend the PR to only use the "read ahead" approach during recovery and not later on.

Jun 10 '21 07:06 kjnilsson

@luos now that https://github.com/rabbitmq/ra/pull/223 has been merged, would you be interested in repeating your test against a recent RabbitMQ 3.9.0 alpha? We can provide a link to a Docker image tag or a specific package type.

Jun 22 '21 04:06 michaelklishin

Hi, @michaelklishin . Yes, please share the docker image/tag with me, that would greatly decrease the preparations I need to take, therefore increasing the chance that I can test this. Thanks! :)

Jun 22 '21 05:06 luos

@luos

docker pull pivotalrabbitmq/rabbitmq:master

should get you an OCI image of the latest RabbitMQ master build.

Jun 22 '21 09:06 michaelklishin

Thanks for the tag, the commit I pulled was 0cbef999cceea632a953111c955548f20d6fdc4f.

I am testing with a single queue with 1 000 000 messages, 1 000 byte each in size.

It is definitely much better, instead of 10-15 MB / sec now it's around 40-60 MB / sec.

kép

Comparison with 3.8.17, Erlang 24.0.2 (same version as before):

kép

Testing with 3 queues (3 million messages) I expected 3 times the disk reads:

kép

Which did not happen but it still did about twice as much disk IO, which to me still shows that there is a CPU/single threadedness limit in the recovery, hopefully tomorrow I will be able to dig deeper a little bit, which part is actually slow.

The test I am doing is the following:

Publish many messages into one or multiple queues
Select a node and do stop_app then start_app
Observe the graph on the management interface

Alternatively you can restart all nodes at the same time, but I think that just ruins the measurement.

Thanks for the fixes already, it seems like it improved a lot.

Jun 22 '21 12:06 luos

@kjnilsson now has an improvement that looks so promising we find it too good to be true :) @luos would you have a chance to compare a branch of RabbitMQ server and/or Ra?

Hey @kjnilsson, Can this fix help to improve startup for hundreds of quorum queues (~1000)? in most of the time the queues are empty, or with a small number of messages

Thanks

Jul 13 '21 08:07 liorhasson

No this optimisation is for faster recovery when there are many Raft entries to be read from disk. Feel free to add a separate issue that describes your problem with as much detail as possible.

Jul 13 '21 09:07 kjnilsson

hi @liorhasson do you mind opening an issue or at least provide more details here so we can understand better your problem?

Jul 22 '21 12:07 yaronp68

@luos Is there anything concrete to address as part of this issue or can it be closed?

Mar 24 '22 16:03 kjnilsson

Currently, recovering 1 million 1kb messages in a single queue takes 6-7 seconds for me. I can't even monitor this process with the Management stats, so I think we can consider this done. :)

For extremely long queues, there's still space for improvement and we expect to deliver it at some point in https://github.com/rabbitmq/ra/issues/141

Jul 12 '23 09:07 mkuratczyk

one final note: this was tested with Ra 2.6.3 (not included in any released versions yet), which includes (another) significant improvement for long queue recovery: https://github.com/rabbitmq/ra/pull/385

Jul 12 '23 09:07 mkuratczyk

rabbitmq-server rabbitmq-server copied to clipboard

Improvement request: Faster Quorum Queue startup

rabbitmq-server
rabbitmq-server copied to clipboard