grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

Retry Attempts are not available in `client-stats`

Open davinci26 opened this issue 6 months ago • 12 comments

Clients can configure non transparent retries either via XDS or via ServiceConfig however the stats handler have no good way to retrieve if a particular RPC method is a retry or not.

This type of observability is quite useful to determine if retry-storms(due to poor configurations) cause incidents and/or to see if retries are actually effective.

I thought of looking at the retry metadata header "grpc-previous-rpc-attempts" but this one is added by the http2_client as a header after the HandleRPC method is called so that didnt yield any results.

My first question is: Is this actually true or is there a way to get o11y on the non-transparent retries from the client?

My second question is: If it is not available, the numRetries is available when we construct the stats.Begin struct so if you are open to extending the stats.Begin struct, we can add it there.

Looking forward to your thoughts and also happy to implement the feature if it is not available

davinci26 avatar May 07 '25 19:05 davinci26

My first question is: Is this actually true or is there a way to get o11y on the non-transparent retries from the client?

We don't have retry metrics yet. There is a proposal in progress https://github.com/grpc/proposal/pull/488/files. A metric label grpc.client.call.retries is proposed which has a label grpc.retry_type that can take one of three values - "retry", "hedge", or "transparent". Please take a look at the proposal to see if it has everything what you need. Tagging @yashykt in case you have more questions regarding the proposal.

In tracing though we have a way to identify if an rpc attempt was a transparent retry or not. When span is created, an attribute with key transparent-retry is set with a boolean value indicating whether the stream is undergoing a transparent retry. Tracing is still in experimental state. You can enable tracing through dial options https://github.com/grpc/grpc-go/blob/master/examples/features/opentelemetry/client/main.go#L65

purnesh42H avatar May 08 '25 12:05 purnesh42H

In tracing though we have a way to identify if an rpc attempt was a transparent retry or not.

Seems reasonable, have to check our implementation of tracing to see if we include this or not but thanks for the pointer.

We don't have retry metrics yet. There is a proposal in progress https://github.com/grpc/proposal/pull/488/files. A metric label grpc.client.call.retries is proposed which has a label grpc.retry_type that can take one of three values - "retry", "hedge", or "transparent".

Cool proposal, you guys are one step ahead :D. Everything in the proposal makes sense.

Havent thought this fully so take this with a bit of a grain of salt but given the proposal above and the fact that the otel implementation in grpc-go is a wrapper around StatsHandler would it make sense to extend stats.Begin with:

  1. numRetries
  2. If an RPC is a hedge or not

That way consumers who have an implementation of the stats handler can implement similar metrics as well.

davinci26 avatar May 08 '25 12:05 davinci26

@davinci26 The OpenTelemetry (OTel) plugin provides trace attributes for RPC attempts that help in tracking retries. As you can see in https://github.com/grpc/grpc-go/blob/master/stats/opentelemetry/trace.go#L50-L51 (in the populateSpan function for *stats.Begin), these include:

  • previous-rpc-attempts: This attribute indeed indicates the number of prior attempts for the RPC. The OTel stats.Handler currently calculates this value internally (using ai.previousRPCAttempts).
  • transparent-retry: This boolean flag is directly sourced from stats.Begin.IsTransparentRetryAttempt and clearly signals if the attempt is a gRPC-handled transparent retry.

And yes, for retry-specific "metrics", there's a proposal in progress.

stats.Handler is experimental and for common observability requirements, it's not recommended as the primary solution for most users. The preferred path is to identify what's missing in the official plugins (like the OTel or OpenCensus ones) and work to add those features if they benefit a broad range of users. For users who need very specific, custom metrics that aren't covered by standard tooling, the stats.Handler interface is precisely the mechanism gRPC-Go provides for such advanced customization. Information like numRetries and the type of retry (e.g., standard vs. hedged) are common requirements. The goal is to make such essential data easily accessible, ideally out-of-the-box via the OTel plugin.

purnesh42H avatar May 12 '25 18:05 purnesh42H

previous-rpc-attempts: This attribute indeed indicates the number of prior attempts for the RPC. The OTel stats.Handler currently calculates this value internally (using ai.previousRPCAttempts).

Does this include service level retries? I might be missing something super obvious here but I see that:

ai := &attemptInfo{
              ....
}

is added to the ctx when the RPC is tagged so retries would have a different ctx. Basically I cant understand how https://github.com/grpc/grpc-go/blob/709023de87a25ae63b000139295af10589edffee/stats/opentelemetry/trace.go#L54 can be called twice for the same RPC for non-transparent retries.

stats.Handler is experimental and for common observability requirements, it's not recommended as the primary solution for most users.

100% agree, stats.Handler is a pretty low-level interface that only makes sense for a limited subset of folks and 99% of the users would be better suited with the standard OTel plugin. But I am not sure if we are agreeing or not on the next bit. Should we add the numRetries in the stats.Begin struct? I can see this being helpful both for the OTel plugin as it would make the code easier and also easier for us few that work with the stats.Handler directly.

davinci26 avatar May 12 '25 19:05 davinci26

@davinci26 you are right about the problem with attemptInfo. Each attempt of rpc gets a new attempt info so its wrong to maintain the counter of previous rpc attempts within that. This is a bug. I think the right way to do this is to funnel the clientStream.numRetries all the way to HandleRPC and record the value in "previous-rpc-attempts". clientStream.numRetries exclusively counts non-transparent retries.

Once we do that, we don't have to make it separately available in stats.Begin. Although i guess its fine even if we add numRetries to stats.Begin as an exported field? @dfawley do you see any issue with exposing numRetries in stats.Begin?

purnesh42H avatar May 16 '25 19:05 purnesh42H

Retry attempts can be counted by storing data in the context in a per-call interceptor, and incrementing it in TagRPC.

dfawley avatar May 16 '25 19:05 dfawley

Retry attempts can be counted by storing data in the context in a per-call interceptor, and incrementing it in TagRPC.

What is the issue with funneling cs.numRetries? We need to also make sure to exclude transparent retries https://github.com/grpc/proposal/blob/master/A72-open-telemetry-tracing.md#tracing-information. clientStream.numRetries already have what we want.

purnesh42H avatar May 16 '25 19:05 purnesh42H

Sorry, I missed the bit about excluding transparent retries. You can wait to increment when stats.Begin happens, then, and check that struct's IsTransparentRetryAttempt field.

What is the issue with funneling cs.numRetries?

Generally, adding things requires proof that it is either necessary or sufficiently convenient. So the question should be the other way around: what is the issue with doing it the way that is already possible?

Also...we don't support hedging yet, but when/if we do, those likely wouldn't be included in the clientStream.numRetries, and it would be wrong to not count hedging attempts in our tracing instrumentation.

dfawley avatar May 16 '25 20:05 dfawley

Generally, adding things requires proof that it is either necessary or sufficiently convenient. So the question should be the other way around: what is the issue with doing it the way that is already possible?

Let me code up the suggestion above so take this statement with 75% confidence but where my head is at is that this fairly complicated/non-ergonomic. I appreciate the aversion towards extending the api surface but Retry attempts can be counted by storing data in the context in a per-call interceptor, and incrementing it in TagRPC. and on top You can wait to increment when stats.Begin happens, then, and check that struct's IsTransparentRetryAttempt field. seems fairly easy to get it wrong for something that is pretty useful o11y for service owners. Am I retrying too much

Edit: I think where I am coming from is that I already got this wrong once and I couldnt think of a way to implement the retries hence my question and then we realized that even the grpc-go implementation didn't get this 100% right so maybe we can do better and make it easier for folks to implement this.

davinci26 avatar May 16 '25 20:05 davinci26

The stats package's API is mainly intended for internal usage or for power users. Optimal usability is not really a goal (it already has a ton of rough edges).

Our goal is that 99% of users should be able to use the OpenCensus or OpenTelemetry plugins, instead, and that those should be easy to configure and use.

dfawley avatar May 16 '25 21:05 dfawley

Make sense @dfawley, thanks for the feedback and the pointer!

I still have my reservations:

  • Recommending to power users this approach via extending the context is going to create an implicit api for grpc-go that is much harder to test and modify on the OSS side. But I think this tradeoff is yours to decide, I am happy and I can support either outcome.
  • I think it would simplify the implementation of both OpenCensus/OpenTelemetry plugins + power users.

I think the main advantage on my side is that this method is available right now with the current release, so I ended up implementing it. It works well so far.

(issue is closed on my side, but I think it is used to track the bug on the OTEL plugin so leaving it open)

davinci26 avatar May 18 '25 22:05 davinci26

@purnesh42H I'm looking into this. Please assign the issue to me.

vinothkumarr227 avatar May 19 '25 05:05 vinothkumarr227

The PR seems to have introduced a flaky test: https://github.com/grpc/grpc-go/actions/runs/17614152882/job/50042942932?pr=8547

It's being rolled back.

arjan-bal avatar Sep 10 '25 12:09 arjan-bal

Part of the discussion here is about exposing retry metrics, which is being tracked as part of gRFC A96 and is a stretch goal for this team for Q4'25.

Part of the discussion is about a bug in the oTel implementation about where the number of retries is being tracked. I've created a new issue for that, to keep the discussion pertinent to the issue on hand.

easwars avatar Nov 11 '25 20:11 easwars