opentelemetry-proto icon indicating copy to clipboard operation
opentelemetry-proto copied to clipboard

Minor spec change around UNAVAILABLE / RetryInfo

Open DylanRussell opened this issue 10 months ago • 10 comments

It doesn't make sense that we recommend a RetryInfo with time 0, none of the implementations i've seen special case this value, and it goes against the rest of the spec which says to set it to a real integer.

It also doesn't make since to say the server "MUST" use UNAVAILABLE for retryable errors, and then go on to mention other retryable error status codes..

DylanRussell avatar Jun 03 '25 20:06 DylanRussell

rest of the spec which says to set it to a real integer

Where does it say that?

dashpole avatar Jun 03 '25 20:06 dashpole

https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#otlpgrpc-throttling - this is the section that talks about how to use RetryInfo.

To signal backpressure when using gRPC transport, the server MUST return an error with code Unavailable and MAY supply additional details via status using RetryInfo. Here is a snippet of sample Go code to illustrate:

The only other mention of RetryInfo in the spec is:

The client SHOULD interpret RESOURCE_EXHAUSTED code as retryable only if the server signals that the recovery from resource exhaustion is possible. This is signaled by the server by returning a status containing RetryInfo. In this case the behavior of the server and the client is exactly as described in OTLP/gRPC Throttling section. If no such status is returned, then the RESOURCE_EXHAUSTED code SHOULD be treated as non-retryable.

This part of the spec I'm removing doesn't make sense.. Why must retryable errors return a RetryInfo of 0 and UNAVAILABLE ?

DylanRussell avatar Jun 03 '25 21:06 DylanRussell

i think this can get merged now

DylanRussell avatar Jun 06 '25 18:06 DylanRussell

@tigrannajaryan, this "MUST requirement" does not make sense and is conflicting with other parts of the specification.

This

The server MUST indicate retryable errors using code Unavailable

conficts with

The server MAY use other gRPC codes to indicate retryable [...] errors.

This

MAY supply additional details via status using RetryInfo containing 0 value of RetryDelay

seems to have only sense for RESOURCE_EXHAUSTED but it is is already described in a dedicated sentence

https://github.com/open-telemetry/opentelemetry-proto/blob/68f9c6329ca91f96333ee5b240e305945b122b70/docs/specification.md#L294-L301

it is very strange to call out 0 especially given there is also an example like this:

https://github.com/open-telemetry/opentelemetry-proto/blob/68f9c6329ca91f96333ee5b240e305945b122b70/docs/specification.md#L311-L341

pellared avatar Jun 10 '25 09:06 pellared

@DylanRussell, shouldn't we also change

https://github.com/open-telemetry/opentelemetry-proto/blob/68f9c6329ca91f96333ee5b240e305945b122b70/docs/specification.md?plain=1#L311

to

To signal backpressure when using gRPC transport, the server SHOULD return an

pellared avatar Jun 10 '25 09:06 pellared

Yeah those suggestions SG. I've updated the PR

DylanRussell avatar Jun 10 '25 13:06 DylanRussell

This fixes a bug in the spec and should be allowed.

tigrannajaryan avatar Jun 11 '25 13:06 tigrannajaryan

@open-telemetry/technical-committee, can someone merge it? :wink:

pellared avatar Jun 17 '25 15:06 pellared

Can't merge, the checks are stuck.

tigrannajaryan avatar Jun 17 '25 17:06 tigrannajaryan

@open-telemetry/technical-committee can we have a few more eyes on this? I want to make sure we didn't miss anything.

tigrannajaryan avatar Jun 17 '25 17:06 tigrannajaryan