Legacy Client pooling messes tower::TraceLayer tracing
Version hyper 1.6.0 hyper-util 0.1.12 tower 0.5.2 tower-http 0.6.4
Platform
Linux 6.10.14-linuxkit #1 SMP Tue Apr 15 16:00:54 UTC 2025 aarch64 GNU/Linux
Description When using tower_http::TraceLayer with hyper_util::client::legacy::Client, with HTTP/1 requests, the request spans are kept alive all the way until the underlying idle connection is disposed (90s by default). This prevents us from effectively monitoring kubernetes controller written with kube-rs (which uses this combination).
After looking deeper into it, here's what's happening:
- TraceLayer correctly creates a new span.
- Client creates a new TCP connection via send_request -> try_send_request -> connection_for -> one_connection_for -> connect_to.
- After the handshake, connect_to creates a background dispatcher task to manage the connection.
- TokioExecutor spawns that task after wrapping it with
.in_current_span(), which copies the span. - Now original request finishes, response is processed and original copy of span is dropped. This is where the span should ideally be dropped, reporting the end of the request.
- Instead, the underlying connection is moved to the pool as "idle" and the dispatcher task keeps a reference to the span. Only after the connection is disposed (90s by default), do we see the span dropped and span data reported to the OTEL server.
- We see a lot of requests with duration around 90s, which makes it hard to work with the reported metrics.
Here's the full stack trace of when the span is getting cloned
3: tracing_core::dispatcher::Dispatch::clone_span
at <index.crates.io>/tracing-core-0.1.32/src/dispatcher.rs:659:13
4: tracing::span::Span::current::{{closure}}
at <index.crates.io>/tracing-0.1.40/src/span.rs:552:26
5: tracing_core::dispatcher::get_default
at <index.crates.io>/tracing-core-0.1.32/src/dispatcher.rs:391:16
6: tracing::span::Span::current
at <index.crates.io>/tracing-0.1.40/src/span.rs:550:9
7: tracing::instrument::Instrument::in_current_span
at <index.crates.io>/tracing-0.1.40/src/instrument.rs:129:25
8: <hyper_util::rt::tokio::TokioExecutor as hyper::rt::Executor<Fut>>::execute
at <index.crates.io>/hyper-util-0.1.12/src/rt/tokio.rs:112:22
9: hyper_util::common::exec::Exec::execute
at <index.crates.io>/hyper-util-0.1.12/src/common/exec.rs:34:17
10: hyper_util::client::legacy::client::Client<C,B>::connect_to::{{closure}}::{{closure}}::{{closure}}
at <index.crates.io>/hyper-util-0.1.12/src/client/legacy/client.rs:585:37
11: <core::pin::Pin<P> as core::future::future::Future>::poll
at <rust-lib>/core/src/future/future.rs:124:9
12: <futures_util::future::either::Either<A,B> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/future/either.rs:109:32
13: <F as futures_core::future::TryFuture>::try_poll
at <index.crates.io>/futures-core-0.3.30/src/future.rs:82:9
14: <futures_util::future::try_future::try_flatten::TryFlatten<Fut,<Fut as futures_core::future::TryFuture>::Ok> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/future/try_future/try_flatten.rs:57:41
15: <futures_util::future::try_future::TryFlatten<Fut1,Fut2> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/lib.rs:91:13
16: <futures_util::future::try_future::AndThen<Fut1,Fut2,F> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/lib.rs:91:13
17: <futures_util::future::either::Either<A,B> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/future/either.rs:109:32
18: <hyper_util::common::lazy::Lazy<F,R> as core::future::future::Future>::poll
at <index.crates.io>/hyper-util-0.1.12/src/common/lazy.rs:64:20
19: futures_util::future::future::FutureExt::poll_unpin
at <index.crates.io>/futures-util-0.3.30/src/future/future/mod.rs:558:9
20: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/future/select.rs:118:35
21: hyper_util::client::legacy::client::Client<C,B>::one_connection_for::{{closure}}
at <index.crates.io>/hyper-util-0.1.12/src/client/legacy/client.rs:435:49
22: hyper_util::client::legacy::client::Client<C,B>::connection_for::{{closure}}
at <index.crates.io>/hyper-util-0.1.12/src/client/legacy/client.rs:386:61
23: hyper_util::client::legacy::client::Client<C,B>::try_send_request::{{closure}}
at <index.crates.io>/hyper-util-0.1.12/src/client/legacy/client.rs:281:14
24: hyper_util::client::legacy::client::Client<C,B>::send_request::{{closure}}
at <index.crates.io>/hyper-util-0.1.12/src/client/legacy/client.rs:249:70
25: <hyper_util::client::legacy::client::ResponseFuture as core::future::future::Future>::poll
at <index.crates.io>/hyper-util-0.1.12/src/client/legacy/client.rs:704:9
26: <F as futures_core::future::TryFuture>::try_poll
at <index.crates.io>/futures-core-0.3.30/src/future.rs:82:9
27: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/future/try_future/into_future.rs:34:9
28: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/future/future/map.rs:55:37
29: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/lib.rs:91:13
30: <futures_util::future::try_future::MapErr<Fut,F> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/lib.rs:91:13
31: <tower::util::map_err::MapErrFuture<F,N> as core::future::future::Future>::poll
at <index.crates.io>/tower-0.5.2/src/macros.rs:38:17
32: <tower_http::trace::future::ResponseFuture<Fut,C,OnResponseT,OnBodyChunkT,OnEosT,OnFailureT> as core::future::future::Future>::poll
at <index.crates.io>/tower-http-0.6.4/src/trace/future.rs:64:29
33: <tower::filter::future::AsyncResponseFuture<P,S,Request> as core::future::future::Future>::poll
at <index.crates.io>/tower-0.5.2/src/filter/future.rs:93:28
34: <tower::util::either::EitherResponseFuture<A,B> as core::future::future::Future>::poll
at <index.crates.io>/tower-0.5.2/src/util/either.rs:85:42
35: <tower::util::either::EitherResponseFuture<A,B> as core::future::future::Future>::poll
at <index.crates.io>/tower-0.5.2/src/util/either.rs:84:41
36: <tower_http::map_response_body::ResponseFuture<Fut,F> as core::future::future::Future>::poll
at <index.crates.io>/tower-http-0.6.4/src/map_response_body.rs:182:26
37: <core::pin::Pin<P> as core::future::future::Future>::poll
at <rust-lib>/core/src/future/future.rs:124:9
38: <tower_http::map_response_body::ResponseFuture<Fut,F> as core::future::future::Future>::poll
at <index.crates.io>/tower-http-0.6.4/src/map_response_body.rs:182:26
39: <F as futures_core::future::TryFuture>::try_poll
at <index.crates.io>/futures-core-0.3.30/src/future.rs:82:9
40: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/future/try_future/into_future.rs:34:9
41: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/future/future/map.rs:55:37
42: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/lib.rs:91:13
43: <futures_util::future::try_future::MapErr<Fut,F> as core::future::future::Future>::poll
at <index.crates.io>/futures-util-0.3.30/src/lib.rs:91:13
44: <tower::util::map_err::MapErrFuture<F,N> as core::future::future::Future>::poll
at <index.crates.io>/tower-0.5.2/src/macros.rs:38:17
45: <core::pin::Pin<P> as core::future::future::Future>::poll
at <rust-lib>/core/src/future/future.rs:124:9
46: <tower::buffer::future::ResponseFuture<F> as core::future::future::Future>::poll
at <index.crates.io>/tower-0.5.2/src/buffer/future.rs:75:59
47: kube_client::client::Client::send::{{closure}}
at <index.crates.io>/kube-client-1.0.0/src/client/mod.rs:213:14
48: kube_client::client::Client::request_text::{{closure}}
at <index.crates.io>/kube-client-1.0.0/src/client/mod.rs:285:54
49: kube_client::client::Client::request::{{closure}}
at <index.crates.io>/kube-client-1.0.0/src/client/mod.rs:274:47
50: kube_client::client::Client::apiserver_version::{{closure}}
at <index.crates.io>/kube-client-1.0.0/src/client/mod.rs:410:10
One option to resolve this is to switch off the tracing feature, but it's a bit hard to do in our case because we depend on hyper_utils via kube-client - kube crates. Plus it looks like the opposite direction of what we want to do (we want more tracing, not less).
The other option is to switch off the pooling. Not great as well.
I wonder if you have any ideas of how to maybe make it smarter than just .in_current_span()? Conceptually it seems to be mixing request lifecycle and connection lifecycle, which will always be different when pooling is involved.
Thank you!
Yea, I've seen that happen before (such as in tokio-console). I don't have an easy solution from a tracing perspective.
We could make calling in_current_span configurable on the TokioExecutor type, as one possible solution.
Thanks for the response! I see a bit more context for adding tracing in https://github.com/hyperium/hyper-util/pull/166.
In the discussion there, there was exactly this concern:
I think an issue when doing this automatically is that some tasks that are spawned live longer than the initial context. For example, if a new connection is created when trying to serve a request, and then that connection is put in a pool and used for later requests, the context is no longer the same.
Amusingly, kube-rs was mentioned as one of the crates that would benefit from this :)
I'm not 100% sure how to fix it, but adding in_current_span to connection-level futures does not seem to be the right way. Maybe we can store the span more explicitly in request-level structs/closures and activate it manually?
I wonder if @alpeb or @olix0r as original authors of https://github.com/hyperium/hyper-util/pull/166 have any ideas of how can we make it respect the request lifecycle?
Also for reference and more context, here's an issue in kube-rs: https://github.com/kube-rs/kube/issues/1775
We could make calling in_current_span configurable on the TokioExecutor type, as one possible solution.
Isn't this already the case via the tracing feature?
@ashtuchkin My understanding is that the only real solution for this would be for Hyper to integrate directly with tracing to ensure the task-context accounting is correct. [edit: i don't even think that's necessarily right -- there are probably bigger lifecycle issues to think through, see below]
Additionally: the connection itself probably should not have the request context associated with it, but instead only the connection-level metadata. There's a fundamental mismatch between the lifecycle of TraceLayer and the way these connections/pools are instantiated. It is also probably the case that this could be addressed in kube-rs by structuring the stack differently. (I'm not saying this is easy, just that it's in the solution-space).
the connection itself probably should not have the request context associated with it, but instead only the connection-level metadata.
Totally agree here. Alas the way it's written right now, connections implicitly clone & store the "current" span that was active at request time due to in_current_span construction.
There's a fundamental mismatch between the lifecycle of TraceLayer and the way these connections/pools are instantiated.
The way I see it is that TraceLayer works correctly here. It creates a tracing span at the beginning of request and makes sure it's dropped at the end of response. The fact that hyper Client unexpectedly holds on to that span and attaches it to a connection-level task which outlives request seems unexpected.
In fact, I struggle to see a use case where this arrangement is expected. Do we ever want to attach a span to a connection that can be shared between requests? Does it work for your use case? Is it because with HTTP/2 it's a different codepath (I haven't looked deep into it)?
It is also probably the case that this could be addressed in kube-rs by structuring the stack differently. (I'm not saying this is easy, just that it's in the solution-space).
I'm curious if you have any ideas how to do that!
I have been debugging an issue with our telemetry in a complicated internal microservice stack to eventually arrive at this issue. We have kube as a dependency for part of our cargo workspace and it enabled the hyper-util/tracing feature for the whole stack as a result. The end result is that various spans in our tracing inconsistently have their duration extended by multiple minutes if they happen to be the span that spawns a connection; it really messes with telemetry and application processing.
We could make calling
in_current_spanconfigurable on theTokioExecutortype, as one possible solution.
Note that this was / is already possible before https://github.com/hyperium/hyper-util/pull/166 was merged; the caller spawning the future onto the TokioExecutor could opt-in with .in_current_span() themselves.
IMO the only question is whether opt-in or opt-out is the better behaviour here. Previously it was opt-in, after https://github.com/hyperium/hyper-util/pull/166 it's opt-out for all callers and all callers need updating for the changed default. Probably any caller that spawns a long lived background task should be opting-out in case a downstream or sibling depdendency enables the hyper-util/tracing feature. (The best way to opt-out is to create a new span before calling TokioExecutor::execute with no parent, optionally using Span::follows_from to create a link to the current span without extending it.)
I started trying to update hyper-util to opt-out its own background tasks in https://github.com/hyperium/hyper-util/commit/0ccd65805f5b523b2e6afe19e912447c8d0e0769 but this was insufficient to fix my traces; I assume that any callers downstream of hyper-util might also be creating the same problem.
In the end the only thing I could practically do was apply a hyper-util patch to revert https://github.com/hyperium/hyper-util/pull/166
https://github.com/hyperium/hyper-util/commit/ffb1d0d9d0072566f806da312436062d07632540
It would be great to get this solved 🙏
My two cents - this feature seems like it's likely to cause more harm than good, perhaps it's worth considering reverting it and going back to leaving it to callers to opt-in where it's needed?