opentelemetry-rust
opentelemetry-rust copied to clipboard
Closing and exporting all spans on panic (when panic=abort)?
I'm using tracing-opentelemetry. From what I understand, spans rely on Drop implementations to close, and until the span is closed it won't be shipped off. I'm currently using panic=abort for various reasons. I'd love to have a global method that I can call from a panic handler that will close and export all spans before exiting, to make sure enough tracing information has been sent out to debug the panic.
This could be challenging. The Span
holds a weak reference to TracerProvider
via Tracer
. But not the other way around. So it's gonna be hard to try to find all spans and export them.
But other than the spans that in flight. You also need to try to export the ended spans cached in BatchSpanProcessor
. So it maybe makes sense to try to force_flush
TracerProvider
in the panic handler.
Logging active spans on crash, and logging spans that start but never end, are pretty important requirements for a tracing/logging system.
This could be challenging. The Span holds a weak reference to TracerProvider via Tracer. But not the other way around.
This seems like there is a design issue somewhere.
Logging active spans on crash, and logging spans that start but never end, are pretty important requirements for a tracing/logging system.
I am not able to find any other OpenTelemetry implementation which does this automatically. For eg:, in OpenTelemetry .NET, there is this doc which describes one possible way for users to achieve this, but SDK does not do this automatically. https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/trace/reporting-exceptions#unhandled-exception Even this suggestion only finishes/exports those Span in that context. Spans in other threads/context remain unstopped and unexported.
Writing things to a file doesn't have the second problem, and if you do things right it doesn't have the first.
I'm using tracing to output to opentelemetry. Tracing has hooks for when span's start, and continues to function in a panic hook. I get 100s of log lines in stdout, but NOTHING ends up sent to jaeger/zipkin via opentelemetry, due to an error 30s into startup.
I don't know whether it is design or implementation, but all the monitoring "solutions" (jaeger, zipkin, opentelemetry, etc) that I have come across do not deal with unfinished spans (which seems to be because the span start doesn't emit anything). This seems to me a basic viability issue (for opentelemetry as a whole).
Not trying to bash the maintainers here (and it seems like a wider issue anyway), just want to point out that
- This is a BIG problem
- I don't think I'm the only person who will lose a few hours banging their head against a wall thinking "surely there is a way to do this"
The
Span
holds a weak reference toTracerProvider
viaTracer
. But not the other way around.
Conceptually, why is this? Could we do it the other way around?
Related to https://github.com/open-telemetry/opentelemetry-rust/issues/1209