aspnetcore icon indicating copy to clipboard operation
aspnetcore copied to clipboard

Add a server side buffer limit

Open jfheins opened this issue 1 year ago • 19 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Is your feature request related to a problem? Please describe the problem.

We have an internal web application in production that uses ASP.net core SignalR. Our app has dozens of users that might open 4 tabs each. However, some of them use mobile devices that have little processing power. We send about 100 messages per second to every client to keep them up to date. We can see that the (JS) clients have 100% CPU usage and slowly fall behind real time when the message load is high.

Over 99% of the messages are state messages, i.e. as soon as a newer state is known, all older states of the same widget are irrelevent. On the backend (.net 6 Application with Kestrel in a K8s cluster, connected via websockets) the memory consumption grows linearly when such a slow client id connected:

image

While we are also working on the UI itself to increase efficiency, my issue is that one slow client can impact the backend in a dramatic way and degrade the experience for all other clients. I suspect the memory increases because the backend buffers many messages for the slow client. While this makes sense for many scenarios, for our high volume message it is useless to deliver a message that sat in the queue for more than 2 seconds.

Describe the solution you'd like

I would like to limit the server side buffering. Ideally, this would happen on a per-group basis so that we can designate some groups as important with a bigger buffer and other groups as fire-and-forget so that old, undelivered messages get discarded server side. A limited buffer would make the memory consumption predictable but a TTL might be more relatable.

So a TTL API could look like this:

await _hubContext.Clients.Groups("positions")
.SendAsync("positions", "{x: 1, y: 2}", timeToLive: TimeSpan.FromSeconds(2));

We tried to use the cancellation token in a similar way but without success.

A per-client buffer limit would be desirable as well and could be configured in the options on startup.

Additional context

No response

jfheins avatar Aug 04 '22 14:08 jfheins

  1. Have you captured a memory dump and analyzed the memory to verify some of your assumptions about message buffering? That would be the first thing I would do to investigate this issue.
  2. How big are these messages?
  3. Have you tried reducing the message rate? Maybe you can do something here where you use a bounded channel to store the most recent values to send to the group? It's basically a second layer of buffering on top before you send the data to the client.
  4. There's a safeguard in place to make sure that slow clients can't clog out the entire pipeline. There's a timeout you can use to control this.
  5. There are knobs at the transport level (when you call MapHub) that let you control the various buffer sizes, however, these will not cause back pressure to be applied to at the transport level and will not drop data in the transport. You might be able to combine a smaller buffer size with a cancellation token to achieve the behavior you're trying to accomplish.

A per-client buffer limit would be desirable as well and could be configured in the options on startup.

The buffers are per client today, but they are at the transport layer which has no understanding of the message itself, just the serialized bytes. We could introduce a layer of buffering on the hub side but I'm not a fan of making any changes until the scenarios is deeply understood (lots of tests, debugging and a detailed analysis).

davidfowl avatar Aug 04 '22 14:08 davidfowl

Hi @jfheins. We have added the "Needs: Author Feedback" label to this issue, which indicates that we have an open question for you before we can take further action. This issue will be closed automatically in 7 days if we do not hear back from you by then - please feel free to re-open it if you come back to this issue after that time.

ghost avatar Aug 04 '22 18:08 ghost

Hi :-) Thanks for having a look already. I have created a demo application that reproduces our issue. When I run it, it starts off at 100MiB ram usage but it can grow to 240 MiB if you give it some time.

I have added a CancellationToken with 20s timeout so my expectation is that the process uses approximately 3KiB * 200 messages/sec * 20 sec = 12 MiB additional memory to hold the items. When I serialize them into a string, strings are the biggest users of memory: image (At this time, the console logs with 172 MiB used)

I have attached my simplified program to recreate the issue: Issue-Repro.zip The goal here would be that the client receives messages with a delay below 3 seconds, regardless of how slow it processes them. (I have exaggerated the slowness a bit)

To answer your questions:

How big are these messages?

They are typically 1-3KiB in size.

Have you tried reducing the message rate?

We build a realtime system where we have to update locations (and planned paths) of vehicles. We are working on migrating to MessagePack to save bandwidth but I think that we have to keep the update frequency per vehicle. In other words, we do not want to degrade the experience for fast clients to the level where even the slowest device can keep up.

There's a timeout

I have to look deeper into this but when I run the sample I often see messages that are delayed by more than 20s - does the timeout apply to websockets as well?

control buffer sizes art transport level

I have looked at the documentation but I was not sure if it applied to websockets. I will test some settings next week.

jfheins avatar Aug 05 '22 14:08 jfheins

@jfheins can you put the code in a GitHub repro?

davidfowl avatar Aug 07 '22 01:08 davidfowl

When I serialize them into a string, strings are the biggest users of memory

Why are you doing this?

The goal here would be that the client receives messages with a delay below 3 seconds, regardless of how slow it processes them. (I have exaggerated the slowness a bit)

Your original post mentioned memory and buffer sizes but I think this is the most important piece of information about what you’re trying to accomplish.

In other words, we do not want to degrade the experience for fast clients to the level where even the slowest device can keep up.

What’s the expectation for the client if it can’t keep up?

davidfowl avatar Aug 07 '22 02:08 davidfowl

Looking at the sample it doesn't use groups, its broadcasting:

I have added a CancellationToken with 20s timeout so my expectation is that the process uses approximately 3KiB * 200 messages/sec * 20 sec = 12 MiB additional memory to hold the items

There's overhead beyond your serialized messages that you're not accounting for, you can see that in a memory profile (e.g. https://github.com/dotnet/aspnetcore/issues/41343).

How are you measuring memory usage? Working set? What are your memory usage expectations based on?

I have to look deeper into this but when I run the sample I often see messages that are delayed by more than 20s - does the timeout apply to websockets as well?

What's the bebusy() function for on the client side? Does the delay reproduce without that? How many clients are you using for this test? How can we replicate this test on our hardware?

davidfowl avatar Aug 07 '22 05:08 davidfowl

Thanks again for keeping the ball rolling 👍

I have uploaded the code as a repo here: https://github.com/jfheins/SignalRRepro on to your questions:

When I serialize them into a string, strings are the biggest users of memory

Why are you doing this?

I did it to make the example more like our real use case. We get messages from rabbitMq that get deserialized into objects. (So the important thing is that every message is unique) I changed it now so that I just send the static array and the behavior is roughly the same.

What’s the expectation for the client if it can’t keep up?

My expected / desired behavior is that the client receives a subset of messages, so if the client needs 20ms to process one message and the server sends one message every 5ms, the client would receive & process roughly every fourth message.

Looking at the sample it doesn't use groups, its broadcasting:

I wanted to keep the example simple but the issue is the same with groups.

How are you measuring memory usage? Working set? What are your memory usage expectations based on?

I'm logging out WorkingSet64 to console because it works on my mac and gives me an impression of the memory usage. My expectation was that the application would use only "a bit more than needed" memory. So let's work it out: 2 KiB data = 7,3 KiB as JSON string * 200 messages per second = 1,46 MiB/s message traffic Now seeing that I pass a cancellation token that cancels after 20s, I expect 28.5 MiB in the queue. Maybe double that (additional allocations...) So when I run the application (and click start in the UI) I expect that the example app never uses more than 57 MiB additional memory compared to startup time.

What's the bebusy() function for on the client side?

It simulates scripting load. In our UI we have a lot of scripting going on (Angular change detection & more) which makes the UI sluggish when processing many messages.

Does the delay reproduce without that?

No, if the UI keeps up everything is fine and the backend uses a reasonable amount of memory.

How many clients are you using for this test?

Only one but you can aggravate the issue if you open multiple tabs.

How can we replicate this test on our hardware?

Clone the repo above, execute dotnet run, open the browser to https://localhost:7017/ and click the button "Start client". The used memory is logged into the console and the browser documents the delay.

Our desired behavior is so that the delay in the browser remains low (skipping messages to keep up) and the backend uses little extra memory - i.e. skipped messages are discarded. After all, the messages become stale quickly so many of the buffered messages I consider useless.

jfheins avatar Aug 08 '22 10:08 jfheins

I'm more convinced now that you should introduce your own buffer to accomplish what you want. Treat SignalR like the transport layer instead of the messaging layer. You're flooding the transport (TCP) with lots of data and the side effect is back pressure which you're are ignoring in this case. SignalR does not drop messages and the timeout at the transport layer (the one described in my initial reply) is a mitigation to stop hostile actors from purposely slowing down the transport layer (by not reading data).

Controlling memory usage means that in the face of back pressure:

  1. The caller is stopped from sending more data.
  2. The caller drops messages.

SignalR doesn't drop messages (unless you cancelled the call), so if this is the behavior you want, I'd suggest some changes to your code to keep a handle to the most recent message since they aren't important. Also make the change to respect back pressure when sending messages to clients.

Once you do that, you can play around with buffer sizes to see what gives the most satisfactory memory usage. Specifically this one

davidfowl avatar Aug 08 '22 15:08 davidfowl

Over 99% of the messages are state messages, i.e. as soon as a newer state is known, all older states of the same widget are irrelevent. On the backend (.net 6 Application with Kestrel in a K8s cluster, connected via websockets) the memory consumption grows linearly when such a slow client id connected:

Hi there @jfheins, What is your application GC mode? If you are running dotnet in containers you should change your GC mode. image

Proof of concept: https://www.linkedin.com/posts/arminshoeibi_alwasyabrcareabrforabryourabrdevops-activity-6919318871346692096-BU26?utm_source=linkedin_share&utm_medium=member_desktop_web

ArminShoeibi avatar Aug 09 '22 20:08 ArminShoeibi

@jfheins I sent a PR to show how you would await backpressure here https://github.com/jfheins/SignalRRepro/pull/1.

Also to @ArminShoeibi's point, you should overlay garbage collections with the memory growth.

davidfowl avatar Aug 09 '22 21:08 davidfowl

@jfheins I sent a PR to show how you would await backpressure here jfheins/SignalRRepro#1.

Also to @ArminShoeibi's point, you should overlay garbage collections with the memory growth.

Thanks David, I've rewritten this code, like this but I have a memory leak.

public class MessageSender : BackgroundService
{
    private readonly byte[] _dummyData = new byte[2048];
    private IHubContext<MyHub> _hub;

    private readonly CancellationTokenSource cts = new(TimeSpan.FromSeconds(20));
    public MessageSender(IHubContext<MyHub> hubContext)
    {
        _hub = hubContext;
        Random.Shared.NextBytes(_dummyData);
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        await Task.Delay(TimeSpan.FromSeconds(2));
        PeriodicTimer periodicTimer = new(TimeSpan.FromMilliseconds(5));
        while (await periodicTimer.WaitForNextTickAsync(stoppingToken))
        {
            var message = new { time = DateTime.UtcNow.ToString("O"), data = _dummyData };
            await _hub!.Clients.All.SendAsync("clock", message, cts.Token);
        }
    }
}
image

ArminShoeibi avatar Aug 09 '22 21:08 ArminShoeibi

What's the leak?

davidfowl avatar Aug 09 '22 22:08 davidfowl

Hi @jfheins. We have added the "Needs: Author Feedback" label to this issue, which indicates that we have an open question for you before we can take further action. This issue will be closed automatically in 7 days if we do not hear back from you by then - please feel free to re-open it if you come back to this issue after that time.

ghost avatar Aug 09 '22 22:08 ghost

The cancellation token shouldn't be reused and should instead be disposed (or you can reset it).

davidfowl avatar Aug 09 '22 22:08 davidfowl

After running this sample a few times I'm convinced that the memory growth is just because there's been no GCs happening. I've run it under an allocation profiler for an hour (mostly because I forgot it was running 😄) and there were no GCs because it allocated so little. My machine has LOTS of cores though, so I'd like to run the runtime under whatever configuration container limits you were using.

@jfheins can you share how your container limits were configured?

PS: The back pressure situation should still be fixed.

davidfowl avatar Aug 10 '22 06:08 davidfowl

The cancellation token shouldn't be reused and should instead be disposed (or you can reset it).

I got it.

ArminShoeibi avatar Aug 10 '22 07:08 ArminShoeibi

ecause I forgot it was running 😄) and there were no GCs because it allocated so little. My machine has LOTS of cores though, so I'd like to run the runtime under whatever configuration container limits you were using.

@jfheins can you s

@davidfowl did you configure any gc limit during your test

ivanthelad avatar Aug 10 '22 12:08 ivanthelad

@jfheins can you share how your container limits were configured?

Sure 👍 this is what we use now (memory limit before this started happening was 1 GiB)

replicaCount: 1
service:
  type: ClusterIP
  port: 80
resources:
  limits:
    memory: 1500Mi
    cpu: '1.0'
  requests:
    memory: 140Mi
    cpu: '0.05'

As I understand this post, .net should see and respect that limit and do GC in time.

If I add a GC.Collect() call in the logger, the memory seems to stay rather constant btw. I will re-run my demo project with limited memory to see the effect.

It seems that asp.net core uses Server GC by default and we will look into whether workstation GC is better there.

PS: The back pressure situation should still be fixed.

The tricky thing here is that I do not want to degrade all clients to the level of the slowest one. So I'd have to do my own loop over all Clients and send the message with a very short cancellationtoken (so only really send it if the client is idle)?

jfheins avatar Aug 10 '22 13:08 jfheins

I have experimented with the cancellation token a bit more and saw that the message delay can reach about 5x the token time. So with a token of 1s I saw message delay of 5s. A bit unintuitive but I think we can work with that.

jfheins avatar Aug 10 '22 14:08 jfheins

OK I've been playing with this for a bit and think I've come to some conclusions:

  • Buffering is done for performance reasons (as you can imagine), so when you call Clients.All.SendAsync, it's writing into a local buffer in each client and returning as fast as it can. That data is then read by the transport and written there (which has another buffer etc etc all the way to the TLS layer and eventually the OS tcp buffer). There needs to be back pressure in these layers for the cancellation token to do anything useful. So when SendAsync returns, it's put data into each of the buffers of the clients, you can make this very fast by having a short lived cancellation token but that will basically drop slow clients (completely disconnect them), it will not help with making them catch up to new messages.
  • The cancellation token will make sure that slow clients don't affect faster clients or the entire broadcast, but there's no way to make slow clients drop older messages and catch up. The token will end up dropping clients that hit the threshold, so they'll drop and reconnect.
  • The memory in the server side should be fine with or without slow clients. I'm happy to hear that running an induced GC (via GC.Collect) reclaimed the growing memory, it should do that.

Here's my latest change that use a global buffer https://github.com/jfheins/SignalRRepro/pull/2/files#diff-914617f2e5e469273ca2927ab60578f11dacd836d2a02f5e2189b6e8fbe01c61

The only way to make slow clients catch up is to keep track of all clients yourself, keep a buffer per clients, and send messages to that client using that buffer. I plan to write that sample next.

davidfowl avatar Aug 12 '22 05:08 davidfowl

PS:I was wrong about the setting that controls back pressure on the transport side 🤦🏾‍♂️. It's actually https://docs.microsoft.com/en-us/dotnet/api/microsoft.aspnetcore.http.connections.httpconnectiondispatcheroptions.transportmaxbuffersize?view=aspnetcore-6.0#microsoft-aspnetcore-http-connections-httpconnectiondispatcheroptions-transportmaxbuffersize. Though I'm having a hard time getting it to work well in my current spike.

davidfowl avatar Aug 12 '22 06:08 davidfowl

Thanks for the engagement, it really makes me feel that our issue is heard and we're enabled to make more cool software 😊

First off maybe the main reason I chose that title: When the memory accumulates, we are getting OOM exceptions. Having read more about the server GC, I understand that memory accumulation itself is intended behavior, with the GC kicking in as soon as the managed memory exceeds 75% of the pod limit. Therefore my best guess is that a significant chunk of that memory is unmanaged so that the GC is not triggered, yet the overall memory is insufficient. We can tune the behavior using env vars to set the limit differently or workstation GC 👍

Here's my latest change that use a global buffer

Thank you a lot, I will have a deeper look into that. The blocker for prod would be that all clients loose messages when one is slow.

The only way to make slow clients catch up is to keep track of all clients yourself, keep a buffer per clients, and send messages to that client using that buffer. I plan to write that sample next.

That might well be what we have to do then. It would be awesome to have a sample there. My idea so far would be then to minimize message staleness by setting the buffer to 10KB (large enough for the largest message + margin) and essentially discard a few messages as soon as the send task is not "IsCompletedSuccessfully" immediately. (Here a "SendNowOrNever" method would still be nice 😇)

jfheins avatar Aug 13 '22 21:08 jfheins

I’ve been working on this issue for 3 nights now (it’s completely nerd sniped me). I haven’t run with the limits yet but I’ve been focusing on the delay client side not the server side. I’ll share what I have and I’ve been debugging. I’ve been playing with configuring the buffers at all layers to minimize the delay and it’s still pretty unpredictable. On my machine I can’t get it under a 7 second delay but I’m testing client and server on the same machine.

This sorta functionality would indeed need to be a new method. It’s not unreasonable but is a large work item.

davidfowl avatar Aug 13 '22 22:08 davidfowl

One last question, can you reduce the message rate? Is that also an option here?

davidfowl avatar Aug 13 '22 22:08 davidfowl

One last question, can you reduce the message rate? Is that also an option here?

I am currently working on some batching, i.e. collecting messages over 50nms and sending an array. (The idea is that fewer, larger messages have less client overhead than many smaller ones) The next idea is to send information about the viewport to the server and then filter the messages server side. Then users would have less data to process if they are zoomed in. For zoomed out users, we could reduce data fidelity. We did not do that (yet) because we feared a lot of complex (per-client) logic in the backend. (We would have to generate "this widget left your viewport" events) And with this option compared to polling via SignalR, I am leaning more towards polling. (With polling there should be no queue so the message delay should be minimal and the rate is set by the client, so it ticks many boxes already)

Also: Our users love to see all vehicles moving and sampling messages would reduce the awesomeness. They love it so much that they want to add many more vehicles over the next months. Every vehicle sends one message per second.

jfheins avatar Aug 14 '22 08:08 jfheins

As promised https://github.com/jfheins/SignalRRepro/commit/698f9536e413bd89936e84c4e790c5db28cb07ca

davidfowl avatar Aug 14 '22 14:08 davidfowl

As promised

I gave it a test run 👍 There is still plenty of delay building up in the client, but the memory server side looks very good (=constant) 💯

https://user-images.githubusercontent.com/5293502/184552957-50954eb5-fcfa-4918-91f4-1fa84198a1c7.mp4

Btw. when I open 2 tabs, I have the impression that only one of them get messages at any given instant. (I'm running it now on a desktop machine, not on my work laptop).

jfheins avatar Aug 14 '22 20:08 jfheins

I see this behavior too and can’t explain it as yet. I want to verify that behavior on with the clients and servers on different compute. It suggests there's a buffer somewhere but I'm not sure where (there's another buffer in the networking stack but I fiddled with that as well) and saw similar results.

davidfowl avatar Aug 14 '22 20:08 davidfowl

I think the only thing that could be done in this situation would be do:

  • Reduce server side the message rate, the client cannot keep up so the server needs to send less often to the clients.
  • Reduce the delay on the client by having the client send an act back to the server with the last ID it saw before sending the next message.

davidfowl avatar Aug 15 '22 14:08 davidfowl

OK it's solved! I had to update to .NET 7 to make this easier since we just added a new feature that makes this work (client results). That basically waits for the client to ack that it processed a message before moving to the next one. Doing it with .NET 6 is doable it just needs more work to handling sending the ack back to the server manually.

See https://github.com/jfheins/SignalRRepro/commit/1d10044da9f601da53a20618b25f3d56f090a73b

https://user-images.githubusercontent.com/95136/184661346-28e95ea5-260d-4447-91a9-39077e672326.mp4

davidfowl avatar Aug 15 '22 15:08 davidfowl