h2 Library does not scale with multiple cores

As demonstrated by the benchmarks in this reddit post, you can see the rust_tonic_mt benchmark falling behind in performance as the number of threads are increased.

The likely cause for this could be that a big portion of the shared state is behind this Mutex.

Apr 21 '21 15:04 bIgBV

A couple of things to try:

Replace the std::sync::mutex with one from parking_lot
Update hyper to keep all streams from a connection on a single thread.

Apr 21 '21 15:04 bIgBV

A bigger effort would be to replace the single massive lock to per-stream locks. There might still need to be a large lock, but the goal would be to not need to keep it locked for long or for most operations.

It's not exactly the same here, but grpc-go did a similar change a couple years ago to reduce contention: https://github.com/grpc/grpc-go/pull/1962

Apr 21 '21 16:04 seanmonstar

Some initial measurements from replacing the lock in `streams.rs with one from parking lot:

https://gist.github.com/bIgBV/4d6d76773a948734ebef1367ef5221d5

Apr 30 '21 04:04 bIgBV

@bIgBV It seems that the comparison results of parking_lot and the original implementation are similar?

Sep 07 '22 02:09 w41ter

The libstd Mutex was recently replaced with a new implementation that is both much smaller and significantly faster. There is much less to lose now with per-stream locking.

Sep 07 '22 13:09 notgull

Resurrecting this old issue, but I think I'm hitting this bottleneck fairly acutely. I'm experimenting with using tonic to build something like a load-balancing proxy between grpc streams. I have X clients connecting over Y connections each with Z streams. I then load balance the requests (mostly 1-1 request-response type requests) across I connections each with J streams to K downstream servers.

I was seeing fairly disappointing performance. If I have the external clients hit the backends directly I'm requsets are taking ~200μs at a certain load level. With the proxy in play it's closer to 1ms. I started digging into this bottleneck and found this github issue.

To isolate the problem further, I removed the server component and built a little client implementation (named pummel) that hammers the backend with requests across I connections each with J streams. With any appreciable amount of concurrency, the performance shows similar characteristics to the proxy when compared to our external clients (they happen to be written in elixir).

In profiling pummel I see this lock using a significant amount of CPU time:

If I'm reading this correctly, over 11% of the CPU time is dedicated to this mutex.

Currently, this is all running in a single Tokio runtime. I can configure the number of grpc connections and streams used, so I may play with ideas like starting a separate Tokio runtime per core or having more connections with fewer streams in hopes of reducing contention on this lock.

I don't really have any suggestions on how to improve this at the moment. Just wanted to share my findings. I'm glad to do any further testing if anyone has any ideas on how to improve this.

Nov 04 '23 16:11 jeffutter

@jeffutter thanks for the excellent write-up! A way forward would be to do what I suggested, make per-stream locks so we only need to lock the stream store in-frequently: when adding or removing a stream.

Nov 06 '23 15:11 seanmonstar

@seanmonstar Yeah. I think that would help my specific use case greatly, since I create all of the streams up-front and re-use them for many requests. So the global locks wouldn't occur mid-work. I might try to take a stab at making that change in my free time. Although, it'll probably take me a while to get up-to speed on h2 internals. In the meantime if anyone gives that a try or has any other ideas, I'd be glad to test them out.

Nov 06 '23 16:11 jeffutter

@seanmonstar I’ve been reading through the h2 source code, that grpc-go issue and the HTTP/2 spec. I’d like to take a stab at this. I’ll admit I’m new to h2 and HTTP/2 in any capacity more than a user so it’ll probably take me a bit to ramp up.

My understanding is that ultimately only one Frame can be written to the underlying IO at one time. So there needs to be a single buffer of Frames to send or I suppose a set of buffers and some mechanism to choose which one to take a frame from next. Currently all of the Frames get put in the SendBuffer on the Streams. It looks like each stream has it’s own pending_send Dequeue for it’s own frames. So, Architecturally, do you see those components remaining the same and the idea here being breaking up some of the state in the Store and maybe some of the Actions so that they can be tracked on the stream itself?

Let me know if that’s making any sense 🙃 or if you have any other suggestions as to how you’d go about implementing this.

Also, if you have any general resources for understanding HTTP/2 streams and flow control beyond the spec I’d love to read up more there too.

Thanks again for any help here. Hopefully with a bit of guidance I can help find a solution.

Nov 18 '23 13:11 jeffutter

h2 h2 copied to clipboard

Library does not scale with multiple cores

h2
h2 copied to clipboard