rustup icon indicating copy to clipboard operation
rustup copied to clipboard

chore(dist/features): ship `tracing` and friends by default

Open rami3l opened this issue 1 year ago • 7 comments

Part of #3790.

Rationale

Currently, helping out the Rustup team by enabling local tracing is quite a tedious process (esp. for community contributors), requiring rebuilding Rustup from the exact commit with an extra feature, otel:

https://github.com/rust-lang/rustup/blob/54dd3d00fd20e64dc522517a0d7be4285570a2f1/doc/dev-guide/src/tracing.md?plain=1#L13-L36

After some experiment, it turned out that we actually can ship the tracing features by default without forcing the user to face OTEL connection errors on a daily basis.

To clarify, this does not mean Rustup is setting up a central (a.k.a. phone-home-style) telemetry mechanism, and we will keep the tracing disabled by default unless RUST_LOG has been explicitly set.

Concerns

  • [x] Should we eliminate RUSTUP_DEBUG in favor of RUST_LOG=trace? (Yes.)
  • [x] ~~Should we remove opentelemetry while keeping tracing (https://github.com/rust-lang/rustup/pull/3788#issuecomment-2079165721)? If we should, the otel feature should be renamed.~~ (Not yet, see https://github.com/rust-lang/rustup/pull/3803#issuecomment-2092349333.)

rami3l avatar May 02 '24 11:05 rami3l

I think the artifact size changes are acceptable even if we keep opentelemetry:

BEFORE

image

AFTER

image

rami3l avatar May 02 '24 12:05 rami3l

I think we should use a basic console-based tracing-subscriber setup:

pub(super) fn subscribe() -> tracing::subscriber::DefaultGuard {
    let sub = tracing_subscriber::FmtSubscriber::builder()
        .with_max_level(tracing::Level::TRACE)
        .with_writer(|| TestWriter)
        .finish();
    tracing::subscriber::set_default(sub)
}

struct TestWriter;

impl Write for TestWriter {
    fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
        print!(
            "{}",
            str::from_utf8(buf).expect("tried to log invalid UTF-8")
        );
        Ok(buf.len())
    }
    fn flush(&mut self) -> io::Result<()> {
        io::stdout().flush()
    }
}

This is what I've been using in Quinn for many years. Every test just starts by calling let _guard = subscribe();, which has been a highly effective tool for test observability.

If we do this we can massively simplify the scaffolding for traces/logging:

  • Get rid of all the opentelemetry dependencies, which I don't think we need
  • Remove the use of the otel feature
  • Remove the custom test macro

djc avatar May 02 '24 12:05 djc

Currently, helping out the Rustup team by enabling local tracing is quite a tedious process (esp. for community contributors), requiring rebuilding Rustup from the exact commit with an extra feature, otel:

I don't think this is true. We've not asked people to build with otel enabled that I'm remembering. The OS level traces we use to debug fundamental problems are from strace / truss and other similar tools. Otel / tracing! is not deployed widely enough within rustup to be a replacement for such things.

rbtcollins avatar May 02 '24 19:05 rbtcollins

Get rid of all the opentelemetry dependencies, which I don't think we need Remove the use of the otel feature Remove the custom test macro

Please don't - while the OS level debugging is vital, for doing investigations on performance, having a nice report with spans and the detailed call tree is very useful, and since we configure it off by default it has very little overhead to maintain or build with. Really only the all-features-build test matters.

rbtcollins avatar May 02 '24 19:05 rbtcollins

Get rid of all the opentelemetry dependencies, which I don't think we need Remove the use of the otel feature Remove the custom test macro

Please don't - while the OS level debugging is vital, for doing investigations on performance, having a nice report with spans and the detailed call tree is very useful, and since we configure it off by default it has very little overhead to maintain or build with. Really only the all-features-build test matters.

How many times have you used it in the past year? IMO while custom test macro + opentelemetry dependencies may not impose run-time overhead for downstream users, it does impose significant maintenance overhead that may not be warranted for the additional insight compared to just tracing-subscriber built-in (and maybe tokio-console level output).

djc avatar May 02 '24 19:05 djc

Currently, helping out the Rustup team by enabling local tracing is quite a tedious process (esp. for community contributors), requiring rebuilding Rustup from the exact commit with an extra feature, otel:

I don't think this is true. We've not asked people to build with otel enabled that I'm remembering. The OS level traces we use to debug fundamental problems are from strace / truss and other similar tools. Otel / tracing! is not deployed widely enough within rustup to be a replacement for such things.

IMO the important point is that we should have a user-facing solution like "enable RUST_LOG=trace and give us the output of that", which seems like a decent method of getting better insight into problems that happen only in specific environments, which seems to be an important source of issues for rustup.

djc avatar May 02 '24 19:05 djc

I just checked and it looks like tokio-console doesn’t currently have a timeline view (https://github.com/tokio-rs/console/issues/129), so I imagine opentelemetry and jaeger are here to stay for longer...

For now I plan to:

  • Ship tracing by default with a console-based subscriber.
  • If possible, reimplement our current logging system using that subscriber (with a fmt::layer() that mimics the original output style).
  • Keep the opentelemetry related stuff behind the otel feature.

More specifically, I imagine having multiple subscribers (https://github.com/tokio-rs/tracing/issues/971) based on env vars and features:

  • A "classic" subscriber that targets process().stderr(), has a classic output format, will only print rustup log lines up to a certain level (info), and will be disabled if RUST_LOG is set.
  • A "tracing" subscriber that also targets process().stderr() and is not limited to rustup (so we could have tonic log lines as well, for example). Its activation is mutually exclusive with the "classic" subscriber, and the precise logging level will be controlled by RUST_LOG.
  • An OpenTelemetry subscriber (could be replaced by tokio-console in the future, but not just yet) available behind the otel feature.

Finally:

  • [x] We need to make sure this subscriber's use of CLI colors is coherent with the current system (incl. env variable controls via RUSTUP_TERM_COLOR, etc).
  • [x] With the consistent use of RUST_LOG, RUSTUP_DEBUG should be retired accordingly.
  • [x] Make tracing::instrument always run, i.e. no longer behind #[cfg_attr(feature = "otel", ..)] (split into #3873).

~~Waiting for #3367 might be worthwhile, since this will change the startup process.~~ Merged.

rami3l avatar May 03 '24 05:05 rami3l

#3367 has been merged, would be good to rebase this!

djc avatar May 30 '24 13:05 djc

The CI now works, but I have no idea how to properly inject the otel subscriber into tests, because it's a global state, but the new stderr subscriber is local (depends on process()), and there's seemingly no way for them to live together properly according to my knowledge of the tracing APIs.

@rbtcollins @djc Am I missing something, or does the current architecture need a greater change (e.g. even more asynchronization) to allow this to happen? I'm a bit confused...

PS: It should be noted that everything works perfectly fine if we only need the stderr logger but not the otel logger for tests.

Details

The status quo is that the only subscriber config we have (i.e. the otel one) remains the same across all tests, so we're using a static item, and we register it with tracing::subscriber::set_global_default():

https://github.com/rust-lang/rustup/blob/3ba08da98221b2941aeb858e54cebc059859ffb7/src/test.rs#L235-L237 https://github.com/rust-lang/rustup/blob/3ba08da98221b2941aeb858e54cebc059859ffb7/src/test.rs#L275-L276 https://github.com/rust-lang/rustup/blob/3ba08da98221b2941aeb858e54cebc059859ffb7/src/test.rs#L280-L308

However now the new stderr logger (and thus the entire subscriber layer) relies on process() in order to attach to stderr properly, so the old approach won't just work.

To make things worse, in before_test() and before_test_async() the process() context is not yet available, so one possibility would be re-injecting subscriber(s) locally (instead of globally like before) within functions like with() and with_runtime(). This unfortunately will make the previously-decoupled currentprocess and test modules couple:

https://github.com/rust-lang/rustup/blob/6d54875d2c8acc7c52727c18feea16f396a1f1b7/src/currentprocess.rs#L147-L150 https://github.com/rust-lang/rustup/blob/6d54875d2c8acc7c52727c18feea16f396a1f1b7/src/currentprocess.rs#L252

This is probably not ideal for two reasons:

  • with*() look like universal functions that are used not only for tests (are they?), meaning I cannot depend on static TRACER in this new implementation. However, we must still be able to flush the TRACER at the end of the function when we are actually in the tests. https://github.com/rust-lang/rustup/blob/3ba08da98221b2941aeb858e54cebc059859ffb7/src/test.rs#L294-L308 However, if we do intend to use with* for tests only, then exposing and using TRACER within them don't look like that much of an issue.
  • 3ba08da98221b2941aeb858e54cebc059859ffb7 has experimented with the stderr logger and everything works. However once the local otel logger comes into play I start to get errors within with() that indicates the lack of a Tokio context:
    thread 'suite::cli_exact::cross_install_indicates_target' panicked at /Users/rami3l/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tonic-0.11.0/src/transport/service/executor.rs:15:9:                                                                                                             
    there is no reactor running, must be called from the context of a Tokio 1.x runtime                                                                            
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    

I wish I could have a way of appending to rather than overriding the current subscriber.

rami3l avatar Jun 02 '24 05:06 rami3l

How many times have you used it in the past year? IMO while custom test macro + opentelemetry dependencies may not impose run-time overhead for downstream users, it does impose significant maintenance overhead that may not be warranted for the additional insight compared to just tracing-subscriber built-in (and maybe tokio-console level output).

I have been largely absent for the last year, so perhaps a better thing to ask is 'of the times you've been investigating performance, how often have you used detailed call graph with opentelemetry' - and the answer there is every single time. I depend on it.

IMO the important point is that we should have a user-facing solution like "enable RUST_LOG=trace and give us the output of that", which seems like a decent method of getting better insight into problems that happen only in specific environments, which seems to be an important source of issues for rustup.

I think that that is important too, for sure.

rbtcollins avatar Jun 03 '24 09:06 rbtcollins

Regarding https://github.com/rust-lang/rustup/pull/3803#issuecomment-2143703392, I think a reasonable solution would be making all unit tests run under #[tokio::test], so that we can stick to a new local subscriber all the time.

This prevents https://docs.rs/opentelemetry-otlp/latest/opentelemetry_otlp/struct.OtlpTracePipeline.html#method.install_batch from complaining about not being executed in a tokio context, and allows the removal of static TRACER and TRACE_RUNTIME altogether. The perf penalty here should be minimal.

I'll split that change into a preparatory PR (#3868).

rami3l avatar Jun 11 '24 14:06 rami3l

@djc The new approach proposed in https://github.com/rust-lang/rustup/pull/3871#pullrequestreview-2115593544 has worked! This PR has been thus unblocked from test refactoring and I'll just need to do some rather simple tweaks (see https://github.com/rust-lang/rustup/pull/3803#issuecomment-2092349333 for the list of remaining tasks) to get it ready 🎉

PS: I've also changed the signature of tracing_subscriber() to accept a Process in order to align with #3871 better.

rami3l avatar Jun 14 '24 06:06 rami3l

@rbtcollins @djc I've just finished all 3 cleanup tasks, unfortunately however the 3rd (the final) commit (https://github.com/rust-lang/rustup/pull/3803/commits/0cf0c6bdd35f9bf8076d78627eab0ee6f01e3a88) has made our CI red. To make sure that only this commit was bad though, I've reverted it temporarily in https://github.com/rust-lang/rustup/pull/3803/commits/1b7662e9890524ab3a795696aee226eea4a499d2, and indeed the CI is green again.

I don't see how a mechanical change like https://github.com/rust-lang/rustup/pull/3803/commits/0cf0c6bdd35f9bf8076d78627eab0ee6f01e3a88 can cause Rustup to stack overflow on Windows (https://github.com/rust-lang/rustup/actions/runs/9515433133/job/26229657328)... Am I missing something?

rami3l avatar Jun 14 '24 11:06 rami3l

Maybe try a branch where you apply https://github.com/rust-lang/rustup/commit/0cf0c6bdd35f9bf8076d78627eab0ee6f01e3a88 to 1.27.1 directly (or some other commit maybe from before #3868) to see if that works? It might be some "interesting" interaction that causes recursion in an unobvious way?

djc avatar Jun 14 '24 13:06 djc

Maybe try a branch where you apply 0cf0c6b to 1.27.1 directly (or some other commit maybe from before #3868) to see if that works? It might be some "interesting" interaction that causes recursion in an unobvious way?

@djc That's a good idea! I'll give it a try tomorrow.

That being said, the overall tracing thing is working pretty well, so it might make some sense to leave https://github.com/rust-lang/rustup/commit/0cf0c6bdd35f9bf8076d78627eab0ee6f01e3a88 for another PR. We'll have to test it against another branch via a PR anyways.

rami3l avatar Jun 14 '24 14:06 rami3l

That being said, the overall tracing thing is working pretty well, so it might make some sense to leave 0cf0c6b for another PR.

Oh yes, I'm in favor of merging without it!

djc avatar Jun 14 '24 14:06 djc