chore(dist/features): ship `tracing` and friends by default
Part of #3790.
Rationale
Currently, helping out the Rustup team by enabling local tracing is quite a tedious process (esp. for community contributors), requiring rebuilding Rustup from the exact commit with an extra feature, otel:
https://github.com/rust-lang/rustup/blob/54dd3d00fd20e64dc522517a0d7be4285570a2f1/doc/dev-guide/src/tracing.md?plain=1#L13-L36
After some experiment, it turned out that we actually can ship the tracing features by default without forcing the user to face OTEL connection errors on a daily basis.
To clarify, this does not mean Rustup is setting up a central (a.k.a. phone-home-style) telemetry mechanism, and we will keep the tracing disabled by default unless RUST_LOG has been explicitly set.
Concerns
- [x] Should we eliminate
RUSTUP_DEBUGin favor ofRUST_LOG=trace? (Yes.) - [x] ~~Should we remove
opentelemetrywhile keepingtracing(https://github.com/rust-lang/rustup/pull/3788#issuecomment-2079165721)? If we should, theotelfeature should be renamed.~~ (Not yet, see https://github.com/rust-lang/rustup/pull/3803#issuecomment-2092349333.)
I think the artifact size changes are acceptable even if we keep opentelemetry:
BEFORE
AFTER
I think we should use a basic console-based tracing-subscriber setup:
pub(super) fn subscribe() -> tracing::subscriber::DefaultGuard {
let sub = tracing_subscriber::FmtSubscriber::builder()
.with_max_level(tracing::Level::TRACE)
.with_writer(|| TestWriter)
.finish();
tracing::subscriber::set_default(sub)
}
struct TestWriter;
impl Write for TestWriter {
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
print!(
"{}",
str::from_utf8(buf).expect("tried to log invalid UTF-8")
);
Ok(buf.len())
}
fn flush(&mut self) -> io::Result<()> {
io::stdout().flush()
}
}
This is what I've been using in Quinn for many years. Every test just starts by calling let _guard = subscribe();, which has been a highly effective tool for test observability.
If we do this we can massively simplify the scaffolding for traces/logging:
- Get rid of all the opentelemetry dependencies, which I don't think we need
- Remove the use of the
otelfeature - Remove the custom
testmacro
Currently, helping out the Rustup team by enabling local tracing is quite a tedious process (esp. for community contributors), requiring rebuilding Rustup from the exact commit with an extra feature, otel:
I don't think this is true. We've not asked people to build with otel enabled that I'm remembering. The OS level traces we use to debug fundamental problems are from strace / truss and other similar tools. Otel / tracing! is not deployed widely enough within rustup to be a replacement for such things.
Get rid of all the opentelemetry dependencies, which I don't think we need Remove the use of the otel feature Remove the custom test macro
Please don't - while the OS level debugging is vital, for doing investigations on performance, having a nice report with spans and the detailed call tree is very useful, and since we configure it off by default it has very little overhead to maintain or build with. Really only the all-features-build test matters.
Get rid of all the opentelemetry dependencies, which I don't think we need Remove the use of the otel feature Remove the custom test macro
Please don't - while the OS level debugging is vital, for doing investigations on performance, having a nice report with spans and the detailed call tree is very useful, and since we configure it off by default it has very little overhead to maintain or build with. Really only the all-features-build test matters.
How many times have you used it in the past year? IMO while custom test macro + opentelemetry dependencies may not impose run-time overhead for downstream users, it does impose significant maintenance overhead that may not be warranted for the additional insight compared to just tracing-subscriber built-in (and maybe tokio-console level output).
Currently, helping out the Rustup team by enabling local tracing is quite a tedious process (esp. for community contributors), requiring rebuilding Rustup from the exact commit with an extra feature, otel:
I don't think this is true. We've not asked people to build with otel enabled that I'm remembering. The OS level traces we use to debug fundamental problems are from strace / truss and other similar tools. Otel / tracing! is not deployed widely enough within rustup to be a replacement for such things.
IMO the important point is that we should have a user-facing solution like "enable RUST_LOG=trace and give us the output of that", which seems like a decent method of getting better insight into problems that happen only in specific environments, which seems to be an important source of issues for rustup.
I just checked and it looks like tokio-console doesn’t currently have a timeline view (https://github.com/tokio-rs/console/issues/129), so I imagine opentelemetry and jaeger are here to stay for longer...
For now I plan to:
- Ship
tracingby default with a console-based subscriber. - If possible, reimplement our current logging system using that subscriber (with a
fmt::layer()that mimics the original output style). - Keep the
opentelemetryrelated stuff behind theotelfeature.
More specifically, I imagine having multiple subscribers (https://github.com/tokio-rs/tracing/issues/971) based on env vars and features:
- A "classic" subscriber that targets
process().stderr(), has a classic output format, will only printrustuplog lines up to a certain level (info), and will be disabled ifRUST_LOGis set. - A "tracing" subscriber that also targets
process().stderr()and is not limited torustup(so we could havetoniclog lines as well, for example). Its activation is mutually exclusive with the "classic" subscriber, and the precise logging level will be controlled byRUST_LOG. - An OpenTelemetry subscriber (could be replaced by
tokio-consolein the future, but not just yet) available behind theotelfeature.
Finally:
- [x] We need to make sure this subscriber's use of CLI colors is coherent with the current system (incl. env variable controls via
RUSTUP_TERM_COLOR, etc). - [x] With the consistent use of
RUST_LOG,RUSTUP_DEBUGshould be retired accordingly. - [x] Make
tracing::instrumentalways run, i.e. no longer behind#[cfg_attr(feature = "otel", ..)](split into #3873).
~~Waiting for #3367 might be worthwhile, since this will change the startup process.~~ Merged.
#3367 has been merged, would be good to rebase this!
The CI now works, but I have no idea how to properly inject the otel subscriber into tests, because it's a global state, but the new stderr subscriber is local (depends on process()), and there's seemingly no way for them to live together properly according to my knowledge of the tracing APIs.
@rbtcollins @djc Am I missing something, or does the current architecture need a greater change (e.g. even more asynchronization) to allow this to happen? I'm a bit confused...
PS: It should be noted that everything works perfectly fine if we only need the stderr logger but not the otel logger for tests.
Details
The status quo is that the only subscriber config we have (i.e. the otel one) remains the same across all tests, so we're using a static item, and we register it with tracing::subscriber::set_global_default():
https://github.com/rust-lang/rustup/blob/3ba08da98221b2941aeb858e54cebc059859ffb7/src/test.rs#L235-L237 https://github.com/rust-lang/rustup/blob/3ba08da98221b2941aeb858e54cebc059859ffb7/src/test.rs#L275-L276 https://github.com/rust-lang/rustup/blob/3ba08da98221b2941aeb858e54cebc059859ffb7/src/test.rs#L280-L308
However now the new stderr logger (and thus the entire subscriber layer) relies on process() in order to attach to stderr properly, so the old approach won't just work.
To make things worse, in before_test() and before_test_async() the process() context is not yet available, so one possibility would be re-injecting subscriber(s) locally (instead of globally like before) within functions like with() and with_runtime(). This unfortunately will make the previously-decoupled currentprocess and test modules couple:
https://github.com/rust-lang/rustup/blob/6d54875d2c8acc7c52727c18feea16f396a1f1b7/src/currentprocess.rs#L147-L150 https://github.com/rust-lang/rustup/blob/6d54875d2c8acc7c52727c18feea16f396a1f1b7/src/currentprocess.rs#L252
This is probably not ideal for two reasons:
with*()look like universal functions that are used not only for tests (are they?), meaning I cannot depend onstatic TRACERin this new implementation. However, we must still be able to flush theTRACERat the end of the function when we are actually in the tests. https://github.com/rust-lang/rustup/blob/3ba08da98221b2941aeb858e54cebc059859ffb7/src/test.rs#L294-L308 However, if we do intend to usewith*for tests only, then exposing and usingTRACERwithin them don't look like that much of an issue.- 3ba08da98221b2941aeb858e54cebc059859ffb7 has experimented with the
stderrlogger and everything works. However once the localotellogger comes into play I start to get errors withinwith()that indicates the lack of a Tokio context:thread 'suite::cli_exact::cross_install_indicates_target' panicked at /Users/rami3l/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tonic-0.11.0/src/transport/service/executor.rs:15:9: there is no reactor running, must be called from the context of a Tokio 1.x runtime note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
I wish I could have a way of appending to rather than overriding the current subscriber.
How many times have you used it in the past year? IMO while custom test macro + opentelemetry dependencies may not impose run-time overhead for downstream users, it does impose significant maintenance overhead that may not be warranted for the additional insight compared to just tracing-subscriber built-in (and maybe tokio-console level output).
I have been largely absent for the last year, so perhaps a better thing to ask is 'of the times you've been investigating performance, how often have you used detailed call graph with opentelemetry' - and the answer there is every single time. I depend on it.
IMO the important point is that we should have a user-facing solution like "enable RUST_LOG=trace and give us the output of that", which seems like a decent method of getting better insight into problems that happen only in specific environments, which seems to be an important source of issues for rustup.
I think that that is important too, for sure.
Regarding https://github.com/rust-lang/rustup/pull/3803#issuecomment-2143703392, I think a reasonable solution would be making all unit tests run under #[tokio::test], so that we can stick to a new local subscriber all the time.
This prevents https://docs.rs/opentelemetry-otlp/latest/opentelemetry_otlp/struct.OtlpTracePipeline.html#method.install_batch from complaining about not being executed in a tokio context, and allows the removal of static TRACER and TRACE_RUNTIME altogether. The perf penalty here should be minimal.
I'll split that change into a preparatory PR (#3868).
@djc The new approach proposed in https://github.com/rust-lang/rustup/pull/3871#pullrequestreview-2115593544 has worked! This PR has been thus unblocked from test refactoring and I'll just need to do some rather simple tweaks (see https://github.com/rust-lang/rustup/pull/3803#issuecomment-2092349333 for the list of remaining tasks) to get it ready 🎉
PS: I've also changed the signature of tracing_subscriber() to accept a Process in order to align with #3871 better.
@rbtcollins @djc I've just finished all 3 cleanup tasks, unfortunately however the 3rd (the final) commit (https://github.com/rust-lang/rustup/pull/3803/commits/0cf0c6bdd35f9bf8076d78627eab0ee6f01e3a88) has made our CI red. To make sure that only this commit was bad though, I've reverted it temporarily in https://github.com/rust-lang/rustup/pull/3803/commits/1b7662e9890524ab3a795696aee226eea4a499d2, and indeed the CI is green again.
I don't see how a mechanical change like https://github.com/rust-lang/rustup/pull/3803/commits/0cf0c6bdd35f9bf8076d78627eab0ee6f01e3a88 can cause Rustup to stack overflow on Windows (https://github.com/rust-lang/rustup/actions/runs/9515433133/job/26229657328)... Am I missing something?
Maybe try a branch where you apply https://github.com/rust-lang/rustup/commit/0cf0c6bdd35f9bf8076d78627eab0ee6f01e3a88 to 1.27.1 directly (or some other commit maybe from before #3868) to see if that works? It might be some "interesting" interaction that causes recursion in an unobvious way?
Maybe try a branch where you apply 0cf0c6b to 1.27.1 directly (or some other commit maybe from before #3868) to see if that works? It might be some "interesting" interaction that causes recursion in an unobvious way?
@djc That's a good idea! I'll give it a try tomorrow.
That being said, the overall tracing thing is working pretty well, so it might make some sense to leave https://github.com/rust-lang/rustup/commit/0cf0c6bdd35f9bf8076d78627eab0ee6f01e3a88 for another PR. We'll have to test it against another branch via a PR anyways.
That being said, the overall
tracingthing is working pretty well, so it might make some sense to leave 0cf0c6b for another PR.
Oh yes, I'm in favor of merging without it!