grpc-go Improve observability of hitting maxConcurrentStream limit

Use case(s) - what problem will this feature solve?

In our setup we are seeing latency spikes caused by hitting maxConcurrentStream limit enforced by the server. The problem is that we had to change the grpc code and rebuild the binary with custom logging to understand that this is a case. I propose improving observability for this, either by exposing metric or some log entry.

Proposed Solution

The naive solution here would be:

Measure latency of this for the loop https://github.com/grpc/grpc-go/blob/12b8fb52a18c8a1667dde7a4f8087ecdd2abbeaf/internal/transport/http2_client.go#L818-L842
If the latency exceeds hardcoded threshold (let's say 50ms), we log info on some higher verbosity level.
- Similar pattern is widely used e.g. kubernetes/client-go, for different waiting time: https://github.com/kubernetes/client-go/blob/e7cd4ba474b5efc2882e377362c9aa8b407428d9/rest/request.go#L613-L615

This will make debugging easier in some testing environments.

We can also expose some metric with histogram of waiting time to be able to track this in production environment, where wedo not enable higher verbosity by default.

Alternatives Considered

Additional Context

Jan 02 '23 08:01 mborsz

Hi @mborsz,

Thanks for bringing this to our attention. Seems like this is one way of going about it.

Measuring latency in that loop would let us know how long it took for the controlbuf to write the message on the wire. I do see benefit of measuring the latency, but a latency spike could translate to a lot of things and not only maxConcurrencyLimit throttle. And a latency in this loop would directly translate to a latency spike in the entire application.

thoughts?

Jan 26 '23 00:01 arvindbr8

I think utilizing channelz for something like this when we detect that we are blocked from creating streams might be the way to go. We've been talking about this recently as we are also taking a holistic look at our metrics in general.

Jan 26 '23 00:01 dfawley

Thanks for all responses. The concern about using channelz only is that IIUC it allows us to debug only when the problem is happening live.

@arvindbr8 I think the latency of controlbuf write is valuable. Seeing some log like 'controlbuf write took Xs' will already give some hint where to look further.

@dfawley In my use case it log-based solution will work better than channelz (which IIUC shows only live problems), as we often debug some tests where we observe latency spikes. While rerunning the test with increased verbosity log is doable, live watching the test running is much harder to do.

Jan 26 '23 09:01 mborsz

Thanks @mborsz. Seems adding logs will give more insight into the control-buf latency and maybe adding a WARNING log when the latency threshold has reached. @dfawley let me know if this is right.

Jan 31 '23 18:01 arvindbr8

I take that back. We discussed using channelz for this cross-language, but we also were considering metrics callouts instead, and we did not make a concrete decision here yet.

Jan 31 '23 21:01 dfawley

cc @yashykt - Yash, can you add this to the list of things you're tracking for metrics if it's not already covered?

Sep 19 '23 23:09 dfawley

grpc-go grpc-go copied to clipboard

Improve observability of hitting maxConcurrentStream limit

Use case(s) - what problem will this feature solve?

Proposed Solution

Alternatives Considered

Additional Context

grpc-go
grpc-go copied to clipboard