rabbitmq-dotnet-client icon indicating copy to clipboard operation
rabbitmq-dotnet-client copied to clipboard

Long delay from publish to consumer event

Open lukebakken opened this issue 2 years ago • 32 comments

Reported by @boskjoett

Discussed in https://github.com/rabbitmq/rabbitmq-dotnet-client/discussions/1249

Originally posted by boskjoett September 9, 2022 We are consistently, but with random frequency, experiencing long delays (more than a second) from the time a publisher delivers a message on the bus with BasicPublish() until a consumer (EventingBasicConsumer) is notified by a Received event. We have tried both with and without auto-ack and using multiple consumers, but without any effect. Our message bus is not overloaded, we can reproduce it with very few short messages. We don't believe it is network delays because we can also reproduce the random delays in a docker-compose stack where both the producer, the consumer and RabbitMQ is running in containers in the same stack and the same docker network.

Code to reproduce:

  • https://github.com/boskjoett/rabbitmq-dotnet-client-1060
  • Python code that does not show the issue - https://github.com/lukebakken/rabbitmq-dotnet-client-1060/tree/lukebakken/python

lukebakken avatar Sep 16 '22 19:09 lukebakken

@bording @bollhals @stebet @danielmarbach - pinging you just to see if anything immediately comes to mind. The issue is that the first call to BasicPublish appears to introduce latency that does not happen with subsequent calls. I suppose this additional latency could be on the consumer side for the first time a message is received. I'll get a packet trace to see if I can determine on "which side" the issue lies.

  • Producer - https://github.com/lukebakken/rabbitmq-dotnet-client-1060/blob/boskjoett-main/producer/Program.cs
  • Consumer - https://github.com/lukebakken/rabbitmq-dotnet-client-1060/blob/boskjoett-main/consumer/Program.cs

The time taken to establish the connection is not part of this calculation, of course.

lukebakken avatar Sep 16 '22 20:09 lukebakken

The first time all the code will have to be jitted by the runtime, this will for sure take some time.

To test this you'd have to send one message, disconnect and build up the connection again in the same process before starting the test sends.

bollhals avatar Sep 17 '22 06:09 bollhals

A workload that opens a connection per message published should not be considered typical for any RabbitMQ client. Should we spend our time on workloads that use reasonably long-lived connections and services?

michaelklishin avatar Sep 17 '22 12:09 michaelklishin

The first time all the code will have to be jitted by the runtime, this will for sure take some time.

The Native Image Generator (Ngen.exe) tool could be used to verify this assumption https://learn.microsoft.com/en-us/previous-versions/dotnet/netframework-2.0/6t9t5wcf(v=vs.80)?redirectedfrom=MSDN

boskjoett avatar Sep 19 '22 06:09 boskjoett

Thanks @bollhals and @boskjoett, I'll start investigating there.

lukebakken avatar Sep 19 '22 15:09 lukebakken

@lukebakken Since you mention that the python version of this test doesn't show a delay, have you tried mixing the python and C# versions to see which pairs also demonstrate the problem? Should be able to help narrow down where the problem is that way.

bording avatar Sep 19 '22 15:09 bording

@lukebakken could it be related to the flow control synchronization?

danielmarbach avatar Sep 19 '22 15:09 danielmarbach

have you tried mixing the python and C# versions to see which pairs also demonstrate the problem?

No but that's an excellent suggestion. Thank you.

lukebakken avatar Sep 19 '22 15:09 lukebakken

@lukebakken could it be related to the flow control synchronization?

Indeed it could. It is exactly those lines you highlighted that we suspect are causing the delays.

boskjoett avatar Sep 19 '22 19:09 boskjoett

I see the additional latency when combining Python / .NET both ways, but the latency is greater when the publisher is .NET and the consumer is Python.

Removing the wait event around flow control didn't have an effect (commit), but it's something I'm going to remove anyway since no other client tries to handle the channel.flow method in this manner.

I'm going to try using ngen next.

lukebakken avatar Sep 19 '22 22:09 lukebakken

Hm, using crossgen doesn't seem to affect the initial message latency. Here's what I did:

  • Enable PublishReadyToRun in both csproj files (https://github.com/lukebakken/rabbitmq-dotnet-client-1060/blob/boskjoett-main/consumer/consumer.csproj#L8, for instance)
  • In the producer and consumer dirs of the above repo & branch, run the following:
    git clean -xffd
    dotnet publish --configuration Release --use-current-runtime --no-self-contained --verbosity detailed
    
  • See producer-publish.txt for the output which shows crossgen was run.
  • Run consumer then producer (producer run in the same manner):
    $ cd consumer/bin/Release/net6.0/arch-x64/
    $ dotnet consumer.dll
    
  • Same initial latency 🤔
    lbakken@shostakovich ~/development/lukebakken/rabbitmq-dotnet-client-1060/consumer (boskjoett-main =)
    $ cd bin/Release/net6.0/arch-x64/
    lbakken@shostakovich ~/development/lukebakken/rabbitmq-dotnet-client-1060/consumer/bin/Release/net6.0/arch-x64 (boskjoett-main =)
    $ dotnet consumer.dll
    CONSUMER: waiting for messages...
    CONSUMER received at 09/19/2022 16:16:43.550872, sent at 09/19/2022 16:16:43.537436 - iteration: 1, delay: 00:00:00.0134368
    CONSUMER received at 09/19/2022 16:16:48.550556, sent at 09/19/2022 16:16:48.548227 - iteration: 2, delay: 00:00:00.0023299
    CONSUMER received at 09/19/2022 16:16:53.551818, sent at 09/19/2022 16:16:53.549104 - iteration: 3, delay: 00:00:00.0027149
    ^CCTRL-C pressed, exiting!
    

lukebakken avatar Sep 19 '22 23:09 lukebakken

Next I'm going to test what @bollhals suggests in this comment:

To test this you'd have to send one message, disconnect and build up the connection again in the same process before starting the test sends.

lukebakken avatar Sep 19 '22 23:09 lukebakken

This is the consumer output from this branch where the consumer and producer both do a "first iteration" where a connection/channel are created to send one message, then both closed, then the "real" test begins. I did not use crossgen in this case because I am not running dotnet publish first.

Notice that the first iteration of the real test still shows more latency, but not nearly as much as in the above test or in @boskjoett's initial report.

Seems like the "warm up" of the library has something to do with this. I'm going to port these tests to Java to see if that library and runtime environment shows the same behavior.

lbakken@shostakovich ~/development/lukebakken/rabbitmq-dotnet-client-1060/consumer (boskjoett-main *=)
$ dotnet run
CONSUMER: waiting for messages...
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration done (message: 09/19/2022 16:44:54.564770), disconnecting and re-connecting...
CONSUMER: waiting for messages...
CONSUMER received at 09/19/2022 16:45:04.595576, sent at 09/19/2022 16:45:04.592378 - iteration: 1, delay: 00:00:00.0031985
CONSUMER received at 09/19/2022 16:45:07.595046, sent at 09/19/2022 16:45:07.593387 - iteration: 2, delay: 00:00:00.0016592
CONSUMER received at 09/19/2022 16:45:10.595347, sent at 09/19/2022 16:45:10.593740 - iteration: 3, delay: 00:00:00.0016079
CONSUMER received at 09/19/2022 16:45:13.595644, sent at 09/19/2022 16:45:13.594106 - iteration: 4, delay: 00:00:00.0015383
CONSUMER received at 09/19/2022 16:45:16.595780, sent at 09/19/2022 16:45:16.594453 - iteration: 5, delay: 00:00:00.0013271
^CCTRL-C pressed, exiting!

lukebakken avatar Sep 19 '22 23:09 lukebakken

I have updated my test programs in this repo https://github.com/boskjoett/rabbitmq-dotnet-client-1060 so that the producer now sends a burst of 1 to 9 messages to the consumer at a time.

It shows two things:

  1. The delivery time of the first burst of messages is 4 to 5 times longer than the subsequent message bursts.
  2. The delivery time of the subsequent bursts varies a lot. More than I would expect. Is this due to TCP flow control?

Output from the test

CONSUMER: waiting 5 seconds to try initial connection
CONSUMER: waiting for messages...
CONSUMER received message 1 at 09-20-2022 09:23:52.473, delay: 30,5195 ms
CONSUMER received message 2 at 09-20-2022 09:23:52.510, delay: 49,3794 ms
CONSUMER received message 3 at 09-20-2022 09:23:52.510, delay: 49,8232 ms
CONSUMER received message 4 at 09-20-2022 09:23:57.469, delay: 5,5229 ms
CONSUMER received message 5 at 09-20-2022 09:23:57.470, delay: 5,2937 ms
CONSUMER received message 6 at 09-20-2022 09:24:02.497, delay: 12,1088 ms
CONSUMER received message 7 at 09-20-2022 09:24:02.502, delay: 9,989 ms
CONSUMER received message 8 at 09-20-2022 09:24:02.505, delay: 8,349 ms
CONSUMER received message 9 at 09-20-2022 09:24:02.512, delay: 15,5252 ms
CONSUMER received message 10 at 09-20-2022 09:24:02.521, delay: 23,2979 ms
CONSUMER received message 11 at 09-20-2022 09:24:02.521, delay: 22,9266 ms
CONSUMER received message 12 at 09-20-2022 09:24:02.522, delay: 22,1546 ms
CONSUMER received message 13 at 09-20-2022 09:24:02.522, delay: 20,4214 ms
CONSUMER received message 14 at 09-20-2022 09:24:07.517, delay: 3,8164 ms
CONSUMER received message 15 at 09-20-2022 09:24:07.518, delay: 2,6562 ms
CONSUMER received message 16 at 09-20-2022 09:24:07.518, delay: 2,971 ms
CONSUMER received message 17 at 09-20-2022 09:24:07.519, delay: 3,1154 ms
CONSUMER received message 18 at 09-20-2022 09:24:12.535, delay: 13,0199 ms
CONSUMER received message 19 at 09-20-2022 09:24:12.541, delay: 12,4516 ms
CONSUMER received message 20 at 09-20-2022 09:24:12.543, delay: 8,7846 ms
CONSUMER received message 21 at 09-20-2022 09:24:12.545, delay: 4,7128 ms
CONSUMER received message 22 at 09-20-2022 09:24:12.549, delay: 3,7252 ms
CONSUMER received message 23 at 09-20-2022 09:24:12.560, delay: 6,016 ms
CONSUMER received message 24 at 09-20-2022 09:24:12.564, delay: 3,1196 ms
CONSUMER received message 25 at 09-20-2022 09:24:12.567, delay: 3,7524 ms

boskjoett avatar Sep 20 '22 07:09 boskjoett

How do we define "a lot"? Publisher rate will vary, in particular with classic queues v1, because of both flow control and because they move messages to disk in batches due to certain internal metric.

Use a quorum queue or a CQv2 to reduce this variability to primarily internal flow control.

michaelklishin avatar Sep 20 '22 13:09 michaelklishin

You are right. It is probably not fair to say "a lot" when it is in the milliseconds range. Our queues are not durable and messages are transient, so I don't expect writes to disk.

boskjoett avatar Sep 20 '22 13:09 boskjoett

Thanks for the additional testing code @boskjoett. I will update my Python code to do the same to see if the same behavior exists.

lukebakken avatar Sep 20 '22 14:09 lukebakken

As I suspected, the Python code is not affected by the same sort of delays - https://github.com/lukebakken/rabbitmq-dotnet-client-1060/commit/e847411ec834de08fd4083af00630150fb3d75f5

At this time I'll try @boskjoett's latest code with the main branch of this repo. There's a chance this issue has been addressed. If not, we can discuss next steps.

lukebakken avatar Sep 20 '22 15:09 lukebakken

As I suspected, the Python code is not affected by the same sort of delays - https://github.com/lukebakken/rabbitmq-dotnet-client-1060/commit/e847411ec834de08fd4083af00630150fb3d75f5

What was the output for it?

bollhals avatar Sep 20 '22 18:09 bollhals

We have discovered that the long random delays (more than a second) we have experienced in our code using RabbitMQ were caused by lack of threads in the .NET Threadpool. Creating new threads on demand can take several hundred milliseconds. Inspired by this line of code in your test application https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/main/projects/TestApplications/MassPublish/Program.cs#:~:text=ThreadPool.SetMinThreads(16%20*%20Environment.ProcessorCount%2C%2016%20*%20Environment.ProcessorCount)%3B we increased the Threadpool size and now the long random delays are gone. There is still the initial delay, when sending the first message, but we can live with that.

boskjoett avatar Sep 20 '22 18:09 boskjoett

We have discovered that the long random delays (more than a second) we have experienced in our code using RabbitMQ were caused by lack of threads in the .NET Threadpool. Creating new threads on demand can take several hundred milliseconds. Inspired by this line of code in your test application

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/main/projects/TestApplications/MassPublish/Program.cs#:~:text=ThreadPool.SetMinThreads(16%20*%20Environment.ProcessorCount%2C%2016%20*%20Environment.ProcessorCount)%3B

we increased the Threadpool size and now the long random delays are gone.

Threads are not blocked for the basicpublish method, nor anywhere in the receiving. So if this fixes the issue for you, you either do more than just basicPublish (e.g. QueueDeclare) or they're blocked by something else in your application.

Do you know what blocks them?

bollhals avatar Sep 20 '22 19:09 bollhals

Generally changing the thread pool limits is something that I have rarely ever seen being useful in production. On the flip side I've seen the problems actually getting worse in certain scenarios because people who override it no longer benefit from the default "smarts" of the ramp up of the thread pool and the optimizations done over time in the thread pool. Of course the thread pool is general purpose and mileage may vary but that being said the defaults are battle tested against hundreds of scenarios. There is a reason why there is a caution warning on the page in the remarks section

https://learn.microsoft.com/en-us/dotnet/api/system.threading.threadpool.setminthreads?view=net-6.0#remarks

Have you tried switching to the async consumer? Have you tried stopping copying the body and see of things are related to GC pauses? Also how is that synthetic benchmark you have in the sample representative of the problem you are trying to solve? What are you trying to achieve that a few MS of delay would actually matter? In your real prod system what are you planning to do there inside the receiving code?

danielmarbach avatar Sep 20 '22 19:09 danielmarbach

The threadpool issue is not in the RabbitMQ .NET Client, but in our application code using it, where it sends and receives messages. We have code that waits on ManualResetEvents, etc. You are right that the correct solution is not to fiddle with the Threadpool size, but to use async/await properly.

boskjoett avatar Sep 20 '22 19:09 boskjoett

@bollhals this is the edited output from the first two batches of messages sent via the Python producer/consumer programs:

CONSUMER received at 2022-09-20 14:12:27.305867, sent at 2022-09-20 14:12:27.305110 - iteration 1, delay: 0:00:00.000757
CONSUMER received at 2022-09-20 14:12:27.306376, sent at 2022-09-20 14:12:27.305300 - iteration 2, delay: 0:00:00.001076
CONSUMER received at 2022-09-20 14:12:27.306736, sent at 2022-09-20 14:12:27.305443 - iteration 3, delay: 0:00:00.001293
CONSUMER received at 2022-09-20 14:12:27.306982, sent at 2022-09-20 14:12:27.305633 - iteration 4, delay: 0:00:00.001349
CONSUMER received at 2022-09-20 14:12:27.307101, sent at 2022-09-20 14:12:27.305835 - iteration 5, delay: 0:00:00.001266
CONSUMER received at 2022-09-20 14:12:27.307210, sent at 2022-09-20 14:12:27.306037 - iteration 6, delay: 0:00:00.001173
CONSUMER received at 2022-09-20 14:12:27.307314, sent at 2022-09-20 14:12:27.306245 - iteration 7, delay: 0:00:00.001069
CONSUMER received at 2022-09-20 14:12:27.307416, sent at 2022-09-20 14:12:27.306449 - iteration 8, delay: 0:00:00.000967
CONSUMER received at 2022-09-20 14:12:27.307517, sent at 2022-09-20 14:12:27.306659 - iteration 9, delay: 0:00:00.000858
CONSUMER received at 2022-09-20 14:12:32.314942, sent at 2022-09-20 14:12:32.312136 - iteration 10, delay: 0:00:00.002806
CONSUMER received at 2022-09-20 14:12:32.316449, sent at 2022-09-20 14:12:32.313078 - iteration 11, delay: 0:00:00.003371
CONSUMER received at 2022-09-20 14:12:32.317333, sent at 2022-09-20 14:12:32.313794 - iteration 12, delay: 0:00:00.003539
CONSUMER received at 2022-09-20 14:12:32.318486, sent at 2022-09-20 14:12:32.314793 - iteration 13, delay: 0:00:00.003693
CONSUMER received at 2022-09-20 14:12:32.318616, sent at 2022-09-20 14:12:32.315836 - iteration 14, delay: 0:00:00.002780
CONSUMER received at 2022-09-20 14:12:32.318728, sent at 2022-09-20 14:12:32.316799 - iteration 15, delay: 0:00:00.001929
CONSUMER received at 2022-09-20 14:12:32.318833, sent at 2022-09-20 14:12:32.317764 - iteration 16, delay: 0:00:00.001069

lukebakken avatar Sep 20 '22 21:09 lukebakken

You may close this issue. It is not a real issue, as long as it only occurs during startup.

boskjoett avatar Nov 17 '22 20:11 boskjoett

This is something to investigate for 7.0

lukebakken avatar Nov 17 '22 21:11 lukebakken

I'm re-opening this just to be sure that version 7.0 fixes this issue:

https://groups.google.com/g/rabbitmq-users/c/V_foZnLFVjs

lukebakken avatar Jun 02 '23 13:06 lukebakken

Hi everyone,

I have just tagged version 7.0.0-alpha.1, which includes async methods for the full public API -

https://github.com/rabbitmq/rabbitmq-dotnet-client/releases/tag/v7.0.0-alpha.1

Once this is pushed to NuGet by GitHub Actions, it would be really really great if you could confirm that the async API resolves this issue in your environment (provided that it's relatively easy to update your test code to the new API).

I will take some time today to re-run my test project with 7.0.0-alpha.1 - https://github.com/lukebakken/rabbitmq-dotnet-client-1060/tree/main

Thanks!

lukebakken avatar Nov 20 '23 18:11 lukebakken

It looks like the initial delay is still there using 7.0.0-alpha.1. The delay is reduced as @bollhals theorized here and as can be demonstrated in this branch.

It looks like this initial latency is due to runtime startup, so I'm going to close this issue.

lukebakken avatar Nov 20 '23 21:11 lukebakken

Have you tried playing around with the JIT settings to further investigate if it is indeed the JIT overhead?

https://learn.microsoft.com/en-us/dotnet/core/runtime-config/compilation

You can also control tiered JIT as part of the csproj

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <TieredCompilation>false</TieredCompilation>
  </PropertyGroup>

</Project>

danielmarbach avatar Nov 21 '23 11:11 danielmarbach