reqwest
reqwest copied to clipboard
resolve tracing span sometimes lingers even after response received
We have noticed a weird situation in our production traces for our axum server where sometimes an endpoint, even if the endpoint makes a response very quickly, that the traces say that the endpoint took a very long time. This doesn't happen all the time, but when it happens, it is always related to reqwest calls (i.e. the reqwest crate).
A trace could look something like this (this is how it looks in our prod metrics platform in azure, you will see below that another platform shows the endpoint as shorter)
endpoint: -------------
something: --
reqwest_call: -----------
stuff_after: --
response: --
that is, the response is done quite quickly, but because the reqwest call seems to hang, the whole endpoint shows as taking a long time. Now, the reqwest calls actually all finish very quickly. You can see that the "stuff_after" continues shortly after, and we have done other approaches to show that they actually finish fast. So something is hanging in the reqwest client that makes this show as taking a long time.
We even added a field to the root request span (using tower tracing) that shows that the endpoint is done quickly. So from the users point of view, nothing is taking a long time, but for us, we can't trust our tracing.
I made a reproducible case here: https://github.com/SorenHolstHansen/reqwest_tracing_bug
To reproduce, I started a local open telemetry platform (signoz or jaeger), started the server in the repo i linked, and then ran seq 1 5000 | xargs -P0 -I{} curl http://127.0.0.1:8008
In signoz, you should then be able to see cases like these:
You can see in the bottom right, that the endpoint took 0 seconds, but resolve took 35 seconds.
I created a similar issue earlier in the tracing crate, which led me to believe that reqwest has long lived spans.