grpc-go
grpc-go copied to clipboard
Improve observability of hitting maxConcurrentStream limit
Use case(s) - what problem will this feature solve?
In our setup we are seeing latency spikes caused by hitting maxConcurrentStream limit enforced by the server. The problem is that we had to change the grpc code and rebuild the binary with custom logging to understand that this is a case. I propose improving observability for this, either by exposing metric or some log entry.
Proposed Solution
The naive solution here would be:
- Measure latency of this for the loop https://github.com/grpc/grpc-go/blob/12b8fb52a18c8a1667dde7a4f8087ecdd2abbeaf/internal/transport/http2_client.go#L818-L842
- If the latency exceeds hardcoded threshold (let's say 50ms), we log info on some higher verbosity level.
- Similar pattern is widely used e.g. kubernetes/client-go, for different waiting time: https://github.com/kubernetes/client-go/blob/e7cd4ba474b5efc2882e377362c9aa8b407428d9/rest/request.go#L613-L615
This will make debugging easier in some testing environments.
We can also expose some metric with histogram of waiting time to be able to track this in production environment, where wedo not enable higher verbosity by default.
Alternatives Considered
Additional Context
Hi @mborsz,
Thanks for bringing this to our attention. Seems like this is one way of going about it.
Measuring latency in that loop would let us know how long it took for the controlbuf to write the message on the wire. I do see benefit of measuring the latency, but a latency spike could translate to a lot of things and not only maxConcurrencyLimit throttle. And a latency in this loop would directly translate to a latency spike in the entire application.
thoughts?
I think utilizing channelz for something like this when we detect that we are blocked from creating streams might be the way to go. We've been talking about this recently as we are also taking a holistic look at our metrics in general.
Thanks for all responses. The concern about using channelz only is that IIUC it allows us to debug only when the problem is happening live.
@arvindbr8 I think the latency of controlbuf write is valuable. Seeing some log like 'controlbuf write took Xs' will already give some hint where to look further.
@dfawley In my use case it log-based solution will work better than channelz (which IIUC shows only live problems), as we often debug some tests where we observe latency spikes. While rerunning the test with increased verbosity log is doable, live watching the test running is much harder to do.
Thanks @mborsz. Seems adding logs will give more insight into the control-buf latency and maybe adding a WARNING log when the latency threshold has reached. @dfawley let me know if this is right.
I take that back. We discussed using channelz for this cross-language, but we also were considering metrics callouts instead, and we did not make a concrete decision here yet.
cc @yashykt - Yash, can you add this to the list of things you're tracking for metrics if it's not already covered?