tracing icon indicating copy to clipboard operation
tracing copied to clipboard

reqwest seems to hang around after instrumented function is done

Open SorenHolstHansen opened this issue 5 months ago • 2 comments

Bug Report

Version

reqwest_tracing_bug v0.1.0
│   │   └── tracing v0.1.40
│   │       ├── tracing-attributes v0.1.27 (proc-macro)
│   │       └── tracing-core v0.1.32
│   │   │   └── tracing v0.1.40 (*)
│   │   └── tracing v0.1.40 (*)
│   │   └── tracing v0.1.40 (*)
│   └── tracing v0.1.40 (*)
│   │       │   └── tracing v0.1.40 (*)
│   │       └── tracing v0.1.40 (*)
│   └── tracing v0.1.40 (*)
├── tracing v0.1.40 (*)
├── tracing-opentelemetry v0.26.0
│   ├── tracing v0.1.40 (*)
│   ├── tracing-core v0.1.32 (*)
│   ├── tracing-log v0.2.0
│   │   └── tracing-core v0.1.32 (*)
│   └── tracing-subscriber v0.3.18
│       ├── tracing v0.1.40 (*)
│       ├── tracing-core v0.1.32 (*)
│       └── tracing-log v0.2.0 (*)
└── tracing-subscriber v0.3.18 (*)

Platform

Darwin Srens-MacBook-Air.local 23.6.0 Darwin Kernel Version 23.6.0: Wed Jul 31 20:53:05 PDT 2024; root:xnu-10063.141.1.700.5~1/RELEASE_ARM64_T8112 arm64

Description

We have noticed a weird situation in our production traces for our axum server where sometimes an endpoint, even if the endpoint makes a response very quickly, that the traces say that the endpoint took a very long time. This doesn't happen all the time, but when it happens, it is always related to reqwest calls (i.e. the reqwest crate).

A trace could look something like this (this is how it looks in our prod metrics platform in azure, you will see below that another platform shows the endpoint as shorter)

endpoint:     -------------
something:    --
reqwest_call:   -----------
stuff_after:       --
response:           --

that is, the response is done quite quickly, but because the reqwest call seems to hang, the whole endpoint shows as taking a long time. Now, the reqwest calls actually all finish very quickly. You can see that the "stuff_after" continues shortly after, and we have done other approaches to show that they actually finish fast. So something is hanging in the reqwest client that makes this show as taking a long time.

We even added a field to the root request span (using tower tracing) that shows that the endpoint is done quickly. So from the users point of view, nothing is taking a long time, but for us, we can't trust our tracing.

I made a reproducible case here: https://github.com/SorenHolstHansen/reqwest_tracing_bug

To reproduce, I started a local open telemetry platform (signoz or jaeger), started the server in the repo i linked, and then ran seq 1 5000 | xargs -P0 -I{} curl http://127.0.0.1:8008

In signoz, you should then be able to see cases like these: Screenshot 2024-09-26 at 12 18 23

You can see in the bottom right, that the endpoint took 0 seconds, but resolve took 35 seconds.

I am not sure if this is strictly the fault of tracing, or something weird happens internally in reqwest or somewhere else.

SorenHolstHansen avatar Sep 26 '24 10:09 SorenHolstHansen