h2
h2 copied to clipboard
Library does not scale with multiple cores
As demonstrated by the benchmarks in this reddit post, you can see the rust_tonic_mt
benchmark falling behind in performance as the number of threads are increased.
The likely cause for this could be that a big portion of the shared state is behind this Mutex
.
A couple of things to try:
- Replace the
std::sync::mutex
with one from parking_lot - Update hyper to keep all streams from a connection on a single thread.
A bigger effort would be to replace the single massive lock to per-stream locks. There might still need to be a large lock, but the goal would be to not need to keep it locked for long or for most operations.
It's not exactly the same here, but grpc-go did a similar change a couple years ago to reduce contention: https://github.com/grpc/grpc-go/pull/1962
Some initial measurements from replacing the lock in `streams.rs with one from parking lot:
https://gist.github.com/bIgBV/4d6d76773a948734ebef1367ef5221d5
@bIgBV It seems that the comparison results of parking_lot and the original implementation are similar?
The libstd Mutex was recently replaced with a new implementation that is both much smaller and significantly faster. There is much less to lose now with per-stream locking.
Resurrecting this old issue, but I think I'm hitting this bottleneck fairly acutely. I'm experimenting with using tonic
to build something like a load-balancing proxy between grpc streams. I have X clients connecting over Y connections each with Z streams. I then load balance the requests (mostly 1-1 request-response type requests) across I connections each with J streams to K downstream servers.
I was seeing fairly disappointing performance. If I have the external clients hit the backends directly I'm requsets are taking ~200μs at a certain load level. With the proxy in play it's closer to 1ms. I started digging into this bottleneck and found this github issue.
To isolate the problem further, I removed the server component and built a little client implementation (named pummel
) that hammers the backend with requests across I connections each with J streams. With any appreciable amount of concurrency, the performance shows similar characteristics to the proxy when compared to our external clients (they happen to be written in elixir).
In profiling pummel
I see this lock using a significant amount of CPU time:
If I'm reading this correctly, over 11% of the CPU time is dedicated to this mutex.
Currently, this is all running in a single Tokio runtime. I can configure the number of grpc connections and streams used, so I may play with ideas like starting a separate Tokio runtime per core or having more connections with fewer streams in hopes of reducing contention on this lock.
I don't really have any suggestions on how to improve this at the moment. Just wanted to share my findings. I'm glad to do any further testing if anyone has any ideas on how to improve this.
@jeffutter thanks for the excellent write-up! A way forward would be to do what I suggested, make per-stream locks so we only need to lock the stream store in-frequently: when adding or removing a stream.
@seanmonstar Yeah. I think that would help my specific use case greatly, since I create all of the streams up-front and re-use them for many requests. So the global locks wouldn't occur mid-work. I might try to take a stab at making that change in my free time. Although, it'll probably take me a while to get up-to speed on h2 internals. In the meantime if anyone gives that a try or has any other ideas, I'd be glad to test them out.
@seanmonstar I’ve been reading through the h2
source code, that grpc-go
issue and the HTTP/2 spec. I’d like to take a stab at this. I’ll admit I’m new to h2
and HTTP/2 in any capacity more than a user so it’ll probably take me a bit to ramp up.
My understanding is that ultimately only one Frame can be written to the underlying IO at one time. So there needs to be a single buffer of Frames to send or I suppose a set of buffers and some mechanism to choose which one to take a frame from next. Currently all of the Frames get put in the SendBuffer on the Streams
. It looks like each stream has it’s own pending_send
Dequeue
for it’s own frames. So, Architecturally, do you see those components remaining the same and the idea here being breaking up some of the state in the Store
and maybe some of the Actions
so that they can be tracked on the stream itself?
Let me know if that’s making any sense 🙃 or if you have any other suggestions as to how you’d go about implementing this.
Also, if you have any general resources for understanding HTTP/2 streams and flow control beyond the spec I’d love to read up more there too.
Thanks again for any help here. Hopefully with a bit of guidance I can help find a solution.