opentelemetry-specification HTTP retry spec clarification needed

What are you trying to achieve?

I started working on implementing the HTTP retry spec in the Java HTTP client instrumentations, and I ran into several questions that I'd like to clarify.

According to this example redirects don't add the http.retry_count attribute, is that right?
What about authorization scenarios? E.g. the server returns 401, so the HTTP client automatically adds a new Authorization header and retries the request. Should this scenario follow the same behavior as redirects?

Does the retry counter reset if e.g. redirect requests are retried? E.g.

GET / - 500 (CLIENT, trace=t1, span=s1)
 |
 --- server (SERVER, trace=t1, span=s2)

GET / - 302 (CLIENT, trace=t2, span=s1, http.retry_count=1)
 |
 --- server (SERVER, trace=t2, span=s2)

GET /hello - 500 (CLIENT, trace=t3, span=s1)
 |
 --- server (SERVER, trace=t3, span=s2)

GET /hello - 200 (CLIENT, trace=t4, span=s1, http.retry_count=1)
 |
 --- server (SERVER, trace=t4, span=s2)

I've skimmed through several HTTP clients sources, and some of them use the same code for all kind of retries (redirects, auth, actual retries). Would it be correct to assume that status codes >=400 (or 500? depending on the auth scenario I guess) should be treated as retries? Request finished with 3xx (and lower) would not get the retry count attribute.

Additional context.

https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/5722

Aug 17 '22 15:08 mateuszrzeszutek

I feel like retry & redirect should be counted with the same counter (they are both "re-sends", see https://www.rfc-editor.org/rfc/rfc9110#name-redirection-3xx). One could add something like a "resend_reason" that, apart from distinguishing redirect and other retries, could also give more detailed information on the "retry" (e.g. retry with authorization, retry after network error, retry after HTTP error)

Aug 17 '22 16:08 Oberon00

We need an assignee for this.

Aug 17 '22 16:08 tigrannajaryan

I feel like retry & redirect should be counted with the same counter (they are both "re-sends", see https://www.rfc-editor.org/rfc/rfc9110#name-redirection-3xx).

I totally agree with that, after thinking about this for a while I too came to a conclusion that we should count all retries, no matter the reason. That's mostly because:

if we were to only count >=500 responses as retries, then this behavior would be in contradiction with how the HTTP client span status is calculated: some errors (40x), even if they cause the request to be re-sent, are not treated as retries.
if we were to count >=400 responses as retries, then there would be a weird discrepancy between how redirects and auth errors are treated. Both are more or less standard scenarios for re-sending the request in the HTTP spec ("The user agent MAY repeat the request with a new or replaced Authorization header field", and, in fact, probably most of the HTTP clients are able to do that automatically), but for some reason we're treating one of them in a different way.
so, to sum up, to me the most sensible solution is to bump the retry counter every time the HTTP request gets re-sent, no matter what the cause is.

One could add something like a "resend_reason" that, apart from distinguishing redirect and other retries, could also give more detailed information on the "retry" (e.g. retry with authorization, retry after network error, retry after HTTP error)

This could be quite easily derived from the HTTP status (or the exception event, in case of some network/IO error), so I think I'd initially skip this.

We need an assignee for this.

If it's about actually implementing the changes and making a PR, you can assign this to me.

Aug 18 '22 09:08 mateuszrzeszutek

opentelemetry-specification opentelemetry-specification copied to clipboard

HTTP retry spec clarification needed

opentelemetry-specification
opentelemetry-specification copied to clipboard