opentelemetry-rust icon indicating copy to clipboard operation
opentelemetry-rust copied to clipboard

Shutting down the global tracer hangs

Open dnaka91 opened this issue 3 years ago • 14 comments

I need to be able to swap out the global tracer instance at runtime, because it should be able to enable/disable without restarting the application (or change the endpoint URL).

I'm using opentelemetry-otlp for trace exporting, as the opentelemetry-jaeger crate is just too broken. When trying to shutdown the global tracer or trying to swap it, it just blocks on the internal Mutex forever.

I've seen several closed issues about this topic, but it seems this problem is not fully resolved yet, as it still hangs for me.

dnaka91 avatar Aug 28 '22 03:08 dnaka91

Could you provide more information on

  • which runtime are you using?
  • are you using BatchSpanProcssor?

Also, could you expand on what you are looking to achieve? If you want to adjust how many spans to send to the backend Sampler may be a better choice than swapping the global tracer instances.

TommyCpp avatar Aug 28 '22 04:08 TommyCpp

Thanks for the quick reply. And sorry if I picked a bit of a spicy tone in the ticket description, have run into several pain points recently while using the opentelemetry crates, sadly.

which runtime are you using?

Tokio runtime.

are you using BatchSpanProcssor?

Yes.

The sample wouldn't really help me here. The application is a rather simple web server that works out of the box with zero configuration. Several global options can then later be activated and customized, which would soon include a tracing collector. That means the admin can change the endpoint at runtime, and in that case the URL must be either updated or the whole tracing client be disabled (in case the URL is empty).

If I would just leave the global tracer set, and say, filter out all traces instead, it'd still continue to connect to the tracing collector, right? Therefore, I'd like to fully shut it down in case the tracing URL is removed.

dnaka91 avatar Aug 28 '22 17:08 dnaka91

have run into several pain points recently while using the opentelemetry crates, sadly.

Sorry to hear that. I will try to help as much as I could.

Regarding your problem, I think the Tracer is not the right place to replace. Note that Tracer is just a "Span Creator". It doesn't holds any information on where to export, how to export etc. Instead, Tracer holds a weak pointer to TracerProvider. We do have a global TracerProvider instance and you can replace it using set_tracer_provider method to change it.

I will see if I can write an example over the weekend but feel free to try it yourself.

TommyCpp avatar Sep 02 '22 07:09 TommyCpp

That's the thing I'm using I think. Not using opentelemetry directly but through the tracing-opentelemetry crate, so there are a few more layers of indirection.

For initialization I use the following code, which returns me a tracer that is wrapped in a tracing Layer for use in the tracing stack. This calls internally set_tracer_provider:


fn init_tracing<S>(otlp_endpoint: String) -> Result<impl Layer<S>>
where
    for<'span> S: Subscriber + LookupSpan<'span>,
{
    opentelemetry::global::set_error_handler(|error| {
        error!(target: "opentelemetry", ?error);
    })?;

    let tracer = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint(otlp_endpoint),
        )
        .with_trace_config(trace::config().with_resource(Resource::new([
            resource::SERVICE_NAME.string(env!("CARGO_CRATE_NAME")),
            resource::SERVICE_VERSION.string(env!("CARGO_PKG_VERSION")),
        ])))
        .install_batch(runtime::Tokio)?;

    Ok(tracing_opentelemetry::layer().with_tracer(tracer))
}

Then, to disable it again I use the following code, where it hangs in the shutdown_tracer_provider:

fn disable_tracing() -> Result<()> {
    opentelemetry::global::set_error_handler(|_| {})?;
    opentelemetry::global::shutdown_tracer_provider();
    Ok(())
}

This is all pulled together in the whole tracing setup, where I use a reload layer to dynamically enable/disable the telemetry endpoint. The TracingToggle instance is later wrapped in an Arc and shared as web app state:


pub struct TracingToggle {
    enable: Box<dyn Fn(String) -> Result<()> + Send + Sync + 'static>,
    disable: Box<dyn Fn() -> Result<()> + Send + Sync + 'static>,
}

#[allow(clippy::missing_errors_doc)]
impl TracingToggle {
    pub fn enable(&self, otlp_endpoint: String) -> Result<()> {
        (self.enable)(otlp_endpoint)
    }

    pub fn disable(&self) -> Result<()> {
        (self.disable)()
    }
}

fn init_logging(otlp_endpoint: Option<String>) -> Result<TracingToggle> {
    let opentelemetry = otlp_endpoint.map(init_tracing).transpose()?;
    let (opentelemetry, handle) = reload::Layer::new(opentelemetry);
    let handle2 = handle.clone();

    let enable = move |endpoint: String| {
        let layer = init_tracing(endpoint)?;
        handle2.reload(Some(layer))?;
        anyhow::Ok(())
    };

    let disable = move || {
        disable_tracing()?;
        handle.reload(None)?;
        anyhow::Ok(())
    };

    tracing_subscriber::registry()
        .with(tracing_subscriber::fmt::layer())
        .with(opentelemetry)
        .with(
            Targets::new()
                .with_target(env!("CARGO_PKG_NAME"), Level::TRACE)
                .with_target("tower_http", Level::TRACE)
                .with_default(Level::INFO),
        )
        .init();

    Ok(TracingToggle {
        enable: Box::new(enable),
        disable: Box::new(disable),
    })
}

dnaka91 avatar Sep 02 '22 08:09 dnaka91

Sorry for the late response. Based on your comment I imagine you change the tracer provider on a spawned task. It may cause issues as shutdown_tracer_provider/set_tracer_provider involves blocking operations. So it may need to run a thread that allows blocking operations. i.e spawn_blocking.

The reason why it's blocking is by spec when tracer providers shut down it should call shutdown function on all span processors, which will invoke forceFlush method to export any remaining spans.

TommyCpp avatar Sep 18 '22 21:09 TommyCpp

I have a similar issue, where I am calling global::shutdown_tracer_provider(); from a panic hook. Though I don't know why it would hang there.

blueforesticarus avatar Mar 17 '23 00:03 blueforesticarus

So this is still an issue for me. For my own small projects, I moved away from OpenTelemetry and built my own OTLP collector that doesn't need as much memory (as OpenTelemetry seems to cache a lot and overall causing a multitude of memory usage compared to before I added it).

But in larger projects this is still an issue. Even a simple test case that setups the tracer and immediately shuts it down, hangs.

If it requires to be run on a separate thread, IMO that should be done internally, as such an essential operation should not fail or hang by default.

dnaka91 avatar May 23 '23 04:05 dnaka91

I think this is the same issue I'm having.

I'm experiencing a problem when using a crate similar to test-log. Logging/telemetry gets initialized for a test, the test is run, then the logging guard is dropped. If the test fails with a panic, it can cause the span's export thread to hang because it blocks on a future that never progresses. That means the processor's shutdown can't complete, which causes the global shutdown to hang when the processor's Drop impl gets triggered.

I'm not sure what the best solution is, but by replacing the block_on call here with a spawn_local() and subsequent run_until_stalled() from a LocalPool, it prevents the thread from locking up.

jsoverson avatar Aug 04 '23 20:08 jsoverson

I'm seeing that the GLOBAL_TRACER_PROVIDER is using a std::sync::RwLock to manage access:

https://github.com/open-telemetry/opentelemetry-rust/blob/dd4c13bd69ca4b24d5a8f21024a466fbb35cdd14/opentelemetry/src/global/trace.rs#L356-L358

I was a bit surprised to see a RwLock here as I believe there's potential for outstanding readers to block a request to write in the lock. In the RwLock.write() docs it says that:

This function will not return while other writers or other readers currently have access to the lock.

I'm wondering if RwLock might be a less than ideal lock type, at least when it comes to shutdown time (I'm assuming it's very likely that there are outstanding RwLockReadGuard's in the system)?

fnichol avatar Feb 02 '24 21:02 fnichol

Yeah, no matter what I do I can't seem to get a call to global::shutdown_tracer_provider to complete.

ShaneMurphy2 avatar Feb 14 '24 18:02 ShaneMurphy2

Seems related https://github.com/open-telemetry/opentelemetry-rust/issues/1143

hdost avatar Feb 21 '24 10:02 hdost