opentelemetry-rust icon indicating copy to clipboard operation
opentelemetry-rust copied to clipboard

opentelemetry-otlp async pipeline stuck after FramedWrite

Open clux opened this issue 3 years ago • 9 comments

The following minimal otlp example with latest tracing, tracing-subscriber, tracing-opentelemetry, opentelemetry, opentelemetry-otlp gets stuck inside tonic or h2 when pushing spans in an async context:

use tracing_subscriber::layer::SubscriberExt;

type Error = Box<dyn std::error::Error + Send + Sync + 'static>;
type Result<T> = std::result::Result<T, Error>;

#[tracing::instrument]
async fn trace_test() ->  Result<()> {
    tracing::warn!("start");
    tracing::warn!("end! span: {:?}", tracing::Span::current());
    Ok(())
}

#[tokio::main]
async fn main() -> Result<()> {
    // k port-forward -n monitoring service/grafana-agent-traces 55680:55680
    let otlp_endpoint = std::env::var("OPENTELEMETRY_ENDPOINT_URL").unwrap_or("http://0.0.0.0:55680".to_string());
    let (tracer, _uninstall) = opentelemetry_otlp::new_pipeline()
        .with_endpoint(&otlp_endpoint)
        .install()?;

    // Register all subscribers
    let collector = tracing_subscriber::registry()
        .with(tracing_opentelemetry::layer().with_tracer(tracer))
        .with(tracing_subscriber::fmt::layer());

    tracing::subscriber::set_global_default(collector)?;
    let r = trace_test().await?; // THIS HANGS INDEFINITELY
    tracing::info!("trace test returned: {:?}", r);

    Ok(())
}

which prints the following and hangs:

Mar 08 13:05:07.540  WARN trace_test: blah: start
Mar 08 13:05:07.540  WARN trace_test: blah: end! span: Span { name: "trace_test", level: Level(Info), target: "blah", id: Id(1), module_path: "blah", line: 6, file: "blah.rs" }
Mar 08 13:05:07.541 TRACE tower::buffer::service: sending request to buffer worker
Mar 08 13:05:07.541 TRACE tower::buffer::worker: worker polling for next message
Mar 08 13:05:07.541 TRACE tower::buffer::worker: processing new request
Mar 08 13:05:07.541 TRACE tower::buffer::worker: resumed=false worker received request; waiting for service readiness
Mar 08 13:05:07.541 TRACE tonic::transport::service::reconnect: poll_ready; idle
Mar 08 13:05:07.541 TRACE tonic::transport::service::reconnect: poll_ready; connecting
Mar 08 13:05:07.541 TRACE hyper::client::connect::http: Http::connect; scheme=Some("http"), host=Some("0.0.0.0"), port=Some(Port(55680))
Mar 08 13:05:07.541 DEBUG hyper::client::connect::http: connecting to 0.0.0.0:55680
Mar 08 13:05:07.541 TRACE tonic::transport::service::reconnect: poll_ready; not ready
Mar 08 13:05:07.541 TRACE tower::buffer::worker: service.ready=false delay
Mar 08 13:05:07.541 TRACE tower::buffer::worker: worker polling for next message
Mar 08 13:05:07.541 TRACE tower::buffer::worker: resuming buffered request
Mar 08 13:05:07.541 TRACE tower::buffer::worker: resumed=true worker received request; waiting for service readiness
Mar 08 13:05:07.541 TRACE tonic::transport::service::reconnect: poll_ready; connecting
Mar 08 13:05:07.541 DEBUG hyper::client::connect::http: connected to 0.0.0.0:55680
Mar 08 13:05:07.541 TRACE hyper::client::conn: client handshake Http2
Mar 08 13:05:07.541 DEBUG h2::client: binding client connection
Mar 08 13:05:07.542 DEBUG h2::client: client connection bound
Mar 08 13:05:07.542 DEBUG FramedWrite::buffer{frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }}: h2::codec::framed_write: send frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }
Mar 08 13:05:07.542 TRACE FramedWrite::buffer{frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }}: h2::frame::settings: encoding SETTINGS; len=18
Mar 08 13:05:07.542 TRACE FramedWrite::buffer{frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }}: h2::frame::settings: encoding setting; val=EnablePush(0)
Mar 08 13:05:07.542 TRACE FramedWrite::buffer{frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }}: h2::frame::settings: encoding setting; val=InitialWindowSize(2097152)
Mar 08 13:05:07.542 TRACE FramedWrite::buffer{frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }}: h2::frame::settings: encoding setting; val=MaxFrameSize(16384)
Mar 08 13:05:07.542 TRACE FramedWrite::buffer{frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }}: h2::codec::framed_write: encoded settings rem=27

deps:

tokio = { version = "1.2.0", features = ["full"] }
tracing = "0.1.25"
tracing-futures = "0.2.5"
tracing-subscriber = "0.2.16"
tracing-opentelemetry = "0.11.0"
opentelemetry = { version = "0.12.0", features = ["trace"] }
opentelemetry-otlp = { version = "0.5.0", features = ["async"] }

Not sure what exactly is going on here, but have done the following debugging to pinpoint a few ways that work:

Bypassing opentelemetry_otlp

The entire pipeline works perfectly without opentelemetry_otlp by replacing it with a opentelemetry stdout pipeline:

diff --git blah.rs blah.rs
index dfb2cd1..529646b 100644
--- blah.rs
+++ blah.rs
@@ -14,9 +14,11 @@ async fn trace_test() ->  Result<()> {
 async fn main() -> Result<()> {
     // k port-forward -n monitoring service/grafana-agent-traces 55680:55680
     let otlp_endpoint = std::env::var("OPENTELEMETRY_ENDPOINT_URL").unwrap_or("http://0.0.0.0:55680".to_string());
-    let (tracer, _uninstall) = opentelemetry_otlp::new_pipeline()
-        .with_endpoint(&otlp_endpoint)
-        .install()?;
+    use opentelemetry::{sdk::export::trace::stdout, trace::Tracer};
+    let (tracer, _uninstall) = stdout::new_pipeline().install();
 
     // Register all subscribers
     let collector = tracing_subscriber::registry()

Spans printed fine (albeit with two logging layers), and the program exits.

Bypassing a global subscriber

The following works end-to-end using opentelemetry-otlp, and shows that my endpoint (a port-forwarded grafana-traces-agent (tempo)) is able to receive the data normally (they show up after program exit on their interface)

diff --git blah.rs blah.rs
index dfb2cd1..85ce7b3 100644
--- blah.rs
+++ blah.rs
@@ -4,7 +4,7 @@ type Error = Box<dyn std::error::Error + Send + Sync + 'static>;
 type Result<T> = std::result::Result<T, Error>;
 
 #[tracing::instrument]
-async fn trace_test() ->  Result<()> {
+fn trace_test() ->  Result<()> {
     tracing::warn!("start");
     tracing::warn!("end! span: {:?}", tracing::Span::current());
     Ok(())
@@ -24,15 +24,10 @@ async fn main() -> Result<()> {
         .with(tracing_subscriber::fmt::layer());
 
-    tracing::subscriber::set_global_default(collector)?;
-    let r = trace_test().await?;
-    tracing::info!("trace test returned: {:?}", r);
+    tracing::subscriber::with_default(collector, || {
+        let r = trace_test();
+        tracing::info!("trace test returned: {:?}", r);
+    });
 
     Ok(())
 }

but this only works in non-async contexts afaikt, and looks awkward to integrate with a larger application.

..so. Am I doing something obviously wrong here? Why is opentelemetry_otlp getting stuck in an async context with a global subscriber, and doing perfectly fine in a non-async context using a scoped subscriber?

clux avatar Mar 08 '21 13:03 clux

Btw have pushed my testing code for this issue to clux/tracing-otlp-test.

It has a few extra things in there relating to fetching trace ids, but it's not relevant to the issue at hand (that's just to help me debug against tempo, outcome is the same).

clux avatar Mar 08 '21 13:03 clux

Ok, more testing. This does not happen with grpc-sys:

-opentelemetry-otlp = { version = "0.5.0", features = ["async", "tokio"] }
+opentelemetry-otlp = { version = "0.5.0", features = ["tokio", "grpc-sys", "openssl"], default-features=false }

works. It sends spans to tempo, they are visible there, and the app doesn't hang. Only needed one extra modification in the passed endpoint url:

-http://0.0.0.0:55680
+0.0.0.0:55680

clux avatar Mar 08 '21 16:03 clux

Thanks for the detailed report and example! This example also seems to be working if the span is not reported via tracing. @jtescher any thoughts?

TommyCpp avatar Mar 09 '21 02:03 TommyCpp

Hm I can't actually reproduce the behavior. Both the code in the description and cloning and running https://github.com/clux/tracing-otlp-test/blob/master/blah.rs seem to run and exit normally. @clux anything specific needed to show the hanging behavior?

jtescher avatar Mar 09 '21 04:03 jtescher

@jtescher I was able to reproduce the issue after I spin up an otel-collector container and use the example to connect it

TommyCpp avatar Mar 09 '21 04:03 TommyCpp

Oh I see the issue now (hadn't bound the ports properly when starting the otlp docker container was my issue).

So this looks like the issue is this configuration is actually trying to trace itself. The cycle is completed from top level tracing span -> otlp exporter -> tonic -> tonic dependencies starting tracing spans and so it recurses that way and hangs.

The internal tracing spans are at TRACE level, so a simple way to get this working again is to add a tracing subscriber EnvFilter so your subscriber becomes:

let collector = tracing_subscriber::registry()
    .with(tracing_opentelemetry::layer().with_tracer(tracer))
    .with(tracing_subscriber::EnvFilter::from("DEBUG"))
    .with(tracing_subscriber::fmt::layer());

Can likely come up with a better solution for detecting this case in tracing-opentelemetry, but for now that should get you up and running again.

jtescher avatar Mar 09 '21 05:03 jtescher

Ah, a tracing cycle. Interesting. The EnvFilter does indeed fix the minimal example, but it leaves my more complex setup broken.

I am thinking that this might be a bit more insidious in a bigger example where dependencies are shared. In particular, my agent (manager) uses tonic deps like hyper (which could be useful to trace), along with a web server. With both of those agents running installed, the envfilter solution doesn't seem to work no matter what I do (and it doesn't even respond to SIGTERM when using tonic)

If you are interested in the bigger example, keep in mind it requires k8s cluster to talk to + installing a crd before running, but otherwise not sure how much I can really remove as it's already an example controller, but I could try to make the original tracing-otlp-test repo bigger as well so it doesn't require quite so much as controller-rs.

clux avatar Mar 09 '21 07:03 clux

Just a small update, but no success this time: I tried to make the bug repo re-trigger the bug with an updated EnvFilter, because it is still present on controller-rs. Tried a few things like including a hyper .get inside the traced async fn and spawning a long running task outside, but neither triggered the bug, so haven't pushed anything new there. Guess there's something funky going on with a combination of a long running task and actix in controller-rs.

Ultimately had to give up on it for now, and am stuck on grpcio for the time being.

clux avatar Mar 10 '21 00:03 clux

Ok, I have finally managed to make it work with tonic now.

Still not sure why it's hanging in my more advanced app, but have at least found out that it's related to the simple exporter, which I probably shouldn't use anyway, and am able to work around it, without switching out tonic.

When I enabled "tokio-support" feature on opentelemetry (0.12), it stopped hanging with tonic. From what i could gather from the older docs this seems to force the batch exporter.

clux avatar Mar 27 '21 21:03 clux