opentelemetry-rust Library crates should propagate errors instead of silently logging them

Description

Problem

The OpenTelemetry Rust SDK currently logs errors on behalf of applications, which is inappropriate for library crates. While ADR-001 provides error handling guidance, it includes a problematic allowance:

Failures during regular operation should not panic, instead returning errors to the caller where appropriate, or logging an error if not appropriate.

This guidance needs to be updated. For library crates, it is never appropriate to log errors. Per CONTRIBUTING.md, the SDK should either return errors to callers or delegate to a global error handler registered by the application. However, many codepaths are logging directly instead, leaving applications unable to respond to failures.

Example from span_processor.rs:

fn on_end(&self, span: SpanData) {
    let result = self.exporter.lock().map(|mut exporter| {
        exporter.export(vec![span])
    });

    if let Err(err) = result {
        otel_error!(
            name: "BatchSpanProcessor.Export.Error",
            error = format!("{:?}", err)
        );
    }
}

Why this is problematic:

Library crates logging on behalf of applications creates several problems:

No error visibility: Applications cannot detect, count, or respond to failures
No integration: Errors cannot be integrated with the application's monitoring, alerting, or metrics systems
Inconsistent formatting: Library logs don't match the application's logging format, style, or context (request IDs, etc.), causing confusion for operators and breaking log ingestion pipelines
Policy violations: The library makes policy decisions (what to log, when, how) that belong to the application

Standard Rust library crates (std, tokio, serde, etc.) return errors and let applications decide how to handle them. OpenTelemetry Rust should follow the same pattern.

Proposed Solution

For synchronous operations with a direct caller:

Return OTelSdkResult or appropriate error types defined in opentelemetry-sdk::error
Let callers decide whether to log, retry, or propagate errors
Aligns with existing SpanExporter, LogExporter, and PushMetricExporter traits which already return OTelSdkResult

For background/asynchronous operations without a direct caller:

Implement an error callback mechanism via with_error_handler() on processor builders
The callback is invoked when background tasks fail
Users can then log, emit metrics, trigger alerts, or implement custom strategies

Affected Areas

Traces (High Priority)

[x] opentelemetry-sdk/src/trace/span_processor.rs - Remove error logging in batch/simple processors
[x] opentelemetry-sdk/src/trace/span_processor_with_async_runtime.rs - Add error callback for background exports
[x] opentelemetry-sdk/src/trace/provider.rs - Remove redundant error logging in shutdown

Metrics (High Priority)

[ ] opentelemetry/src/metrics/instruments.rs - InstrumentProvider trait methods should return Result
[ ] opentelemetry-sdk/src/metrics/meter.rs - Return errors instead of logging and creating no-op instruments
[ ] opentelemetry-sdk/src/metrics/meter_provider.rs - Propagate shutdown errors per ADR-001 patterns
[ ] Periodic reader implementations - Expose background export errors via error callback

Logs (High Priority)

[ ] opentelemetry-sdk/src/logs/log_processor.rs - Make LogProcessor::emit() fallible
[ ] opentelemetry-sdk/src/logs/simple_log_processor.rs - Return errors from emit operations
[ ] opentelemetry-sdk/src/logs/log_processor_with_async_runtime.rs - Add error callback for background processing

Other

[ ] opentelemetry-zipkin/src/exporter/env.rs - Replace eprintln! with proper error returns
[ ] Update examples to demonstrate proper error handling
[ ] Update tests to verify error propagation

Implementation Strategy

Phase 1: Traces
- Remove all otel_error!, otel_warn!, otel_debug! calls that mask export failures
- Add with_error_handler() to BatchSpanProcessorBuilder
- Background export errors invoke user-provided callback
- Synchronous operations return OTelSdkResult
Phase 2: Metrics
- Update trait definitions to return Result types per ADR-001 guidance
- Implement error callbacks for periodic readers
- Update meter implementation to propagate errors from instrument creation
Phase 3: Logs
- Make LogProcessor::emit() fallible where appropriate
- Add error callbacks for async log processors
- Update log appenders to propagate errors
Phase 4: Documentation & Examples
- Update ADR-001 to remove the allowance for logging errors in library crates
- Update all examples to demonstrate proper error handling
- Add migration guide documenting breaking changes
- Document error callback patterns and best practices

Backward Compatibility

This is a breaking change that will require:

Minor version bump (0.x -> 0.y, as the crate is pre-1.0)
Migration guide for users updating from previous versions
Updated examples and documentation
Update to ADR-001 clarifying that library crates must never log errors on behalf of applications

However, the benefits justify the breaking change:

Proper library design following Rust best practices and standard library conventions
Better error visibility and control for applications
Enables custom error handling strategies (retry, metrics, alerting)
Improved debuggability and observability in production

Additional Context

Why ADR-001 allows logging:

The allowance for logging "where errors cannot be returned" likely stems from background operations where there's no direct caller. However, the solution is not to log, but rather to:

Use error callbacks that applications can register
Delegate to a global error handler if one is registered
Return errors wherever possible

Technical feasibility:

The OpenTelemetry specification requires operations like on_end() to be fast and non-blocking, but does not mandate void return types
Returning an error does not violate the non-blocking requirement
Error callbacks provide a way to handle background failures without blocking the hot path
This change aligns with the existing exporter traits (SpanExporter, LogExporter, PushMetricExporter) which already return OTelSdkResult from their methods

Related: This aligns with Rust's error handling best practices, the Rust standard library's patterns, and the principle that libraries should be "honest" about failures, letting applications make all policy decisions about logging, retrying, and error handling.

Oct 24 '25 22:10 svencowart

To help mature the project, I propose we shift from internal logging to error propagation. I've prepared PR #3211 as a proof-of-concept. I'm looking for alignment on this approach before I apply the same pattern to all other signals.

I'm completely open to feedback and happy to iterate to reach a consensus first.

Oct 24 '25 22:10 svencowart

I believe internal-logging is the general approach used across most of OpenTelemetry implementations and aligns with the OpenTelemetry error-handling guidelines, which recommend logging internal errors instead of returning them to the user.

The proper approach would be to configure the error handler at the SDK level (via TracerProvider or LoggerProvider) and use it for error logging. In the case of otel-rust, since internal logs already use the tracing macros, a separate error handler isn’t necessary — the configured tracing subscriber itself can handle and route errors appropriately.

Oct 28 '25 08:10 lalitb

Respectfully, I disagree with your interpretation of OpenTelemetry error-handling guidelines. For the rest of this comment, I will categorize the analysis into two categories: "Error suppression," meaning what this library does now (logging errors only and not returning them), and "Error propagation," meaning when errors occur, they are always returned to the caller (or, if returning to a caller is not possible, they are returned via a callback method).

Going through the relevant error-handling guidelines:

OpenTelemetry implementations MUST NOT throw unhandled exceptions at run time.

Error suppression: pass
Error propagation: pass

API methods MUST NOT throw unhandled exceptions when used incorrectly by end users. The API and SDK SHOULD provide safe defaults for missing or invalid arguments. For instance, a name like empty may be used if the user passes in null as the span name argument during Span construction.

Error suppression: pass
Error propagation: pass

The API or SDK MAY fail fast and cause the application to fail on initialization, e.g. because of a bad user config or environment, but MUST NOT cause the application to fail later at run time, e.g. due to dynamic config settings received from the Collector.

Error suppression: fail — does not allow an application that uses the SDK to fail fast, because it cannot react to those errors without parsing logs in the tracing subscriber
Error propagation: pass

The SDK MUST NOT throw unhandled exceptions for errors in their own operations. For example, an exporter should not throw an exception when it cannot reach the endpoint to which it sends telemetry data.

Error suppression: pass
Error propagation: pass

Background tasks (e.g. threads, asynchronous tasks, and spawned processes) should run in the context of a global error handler to ensure that exceptions do not affect the end user application.

Error suppression: fail — background tasks do have a global error handler (maybe it's a pass if a logging facade is considered a global error handler, but that is a stretch for reasons I'll get to at the end of this comment)
Error propagation: pass

Long-running background tasks should not fail permanently in response to internal errors. In general, internal exceptions should only affect the execution context of the request that caused the exception.

Error suppression: pass
Error propagation: pass

Internal error handling should follow language-specific conventions. In general, developers should minimize the scope of error handlers and add special processing for expected exceptions.

Error suppression: pass
Error propagation: pass

Whenever the library suppresses an error that would otherwise have been exposed to the user, the library SHOULD log the error using language-specific conventions. SDKs MAY expose callbacks to allow end users to handle self-diagnostics separately from application code.

Error suppression: pass — this is my issue: that the SDK decided to suppress and not propagate errors.
Error propagation: pass — we would not suppress any errors, so there would be no need to log on behalf of the user.

SDK implementations MUST allow end users to change the library's default error handling behavior for relevant errors. Application developers may want to run with strict error handling in a staging environment to catch invalid uses of the API, or malformed config. Note that configuring a custom error handler in this way is the only exception to the basic error handling principles outlined above. The mechanism by which end users set or register a custom error handler should follow language-specific conventions.

Error suppression: fail — it's nearly impossible to change the default error handling behavior.
Error propagation: pass

Please correct me if I'm misinterpreting one of these guidelines.

I want to give you a real-world example that caused me to open this issue. I'm currently developing an application that uses opentelemetry-sdk to export traces, meaning it's a client to an OTLP server. Let's say the client-server connection is secured with mTLS. The server rotates certs. Independently, the client re-establishes its connection to the server without the operator having a chance to update its mTLS settings. The result is data loss, unless the application itself is able to respond to the connection error and perform some type of retry buffering on the exports.

Right now, achieving this is very difficult because we do not propagate ALL errors out of opentelemetry-sdk. The only way this would be possible is to set up a tracing subscriber parser for error logs, which then parses logs and decides what to do when errors occur. While doable, this process is very error-prone because it would rely on specific strings to not change in the SDK logs. Also, it's expensive computationally because now we have to perform string pattern matching.

As an alternative to my original suggestion, what if instead of removing those error logs, we can still log them, but also propagate them? My only concern here is that the application would then need to decide which errors to log and which not to, because otherwise it might log errors twice. This increases the complexity for what the application needs to handle. The guidance we should give people is to not log errors from the opentelemetry libraries because they will be logged by the tracing subscriber, and that propagated errors can be used for further application requirements.

Oct 28 '25 17:10 svencowart

I want to give you a real-world example that caused me to open this issue. I'm currently developing an application that uses opentelemetry-sdk to export traces, meaning it's a client to an OTLP server. Let's say the client-server connection is secured with mTLS. The server rotates certs. Independently, the client re-establishes its connection to the server without the operator having a chance to update its mTLS settings. The result is data loss, unless the application itself is able to respond to the connection error and perform some type of retry buffering on the exports.

That’s a valid example, but I’d still consider it an operational reliability concern rather than something that should be addressed through SDK-level error propagation.

When exporters fail due to certificate rotation, network issues, or endpoint downtime, the recovery logic should live inside the processor or exporter - using retries, exponential backoff, and buffering - not in application code. The SDK’s job is to surface failures through diagnostics (metrics, structured logs, or a configurable error handler), so applications can observe and respond when issues persist, without coupling their runtime logic to telemetry internals.

I do agree that SDK initialization failures are different - returning errors at startup makes sense, since misconfiguration or bad credentials should fail fast before telemetry starts flowing. But once the application is running, background exporter failures shouldn’t interrupt or bubble up through user code.

Oct 28 '25 20:10 lalitb

That’s a valid example, but I’d still consider it an operational reliability concern rather than something that should be addressed through SDK-level error propagation.

When exporters fail due to certificate rotation, network issues, or endpoint downtime, the recovery logic should live inside the processor or exporter - using retries, exponential backoff, and buffering - not in application code.

I concede my point. Particularly for situations that can be handled in a general way. What I mean by that is responding to failure using retries, exponential backoff, and buffering is acceptable. However, when the application needs to persist buffers beyond its current runtime so that it can recover from restarts, there will need to be ways to hook into either the buffered state. Again, to your point, that might not be solved through error propagation.

The key insight from your feedback is that I was conflating error propagation (which implies the caller must handle it) with diagnostic callbacks (which supplement existing logging for monitoring purposes).

The SDK’s job is to surface failures through diagnostics (metrics, structured logs, or a configurable error handler), so applications can observe and respond when issues persist, without coupling their runtime logic to telemetry internals.

Would you agree that this part is missing from the SDK today? I don't think the SDK notifies when there is persistent issues which could involve an async error handler, albeit different than what I've implemented in my original PR commit. With my latest push, that is taken care of.

I do agree that SDK initialization failures are different - returning errors at startup makes sense, since misconfiguration or bad credentials should fail fast before telemetry starts flowing. But once the application is running, background exporter failures shouldn’t interrupt or bubble up through user code.

To improve my understand of your position, why not bubble up through user code? My take is that performing both logging and error propagation through the SDK offers the most flexibility, enabling the applications to deliver on bespoke requirements. The error handler doesn't "bubble up" in the traditional sense - it's an opt-in callback for diagnostics, not a required error handling path. Background operations continue regardless of whether a handler is registered.

PS, I've updated my PR #3211 to align with this approach.

Oct 28 '25 22:10 svencowart

To improve my understand of your position, why not bubble up through user code? My take is that performing both logging and error propagation through the SDK offers the most flexibility, enabling the applications to deliver on bespoke requirements. The error handler doesn't "bubble up" in the traditional sense - it's an opt-in callback for diagnostics, not a required error handling path. Background operations continue regardless of whether a handler is registered.

When I said "don't bubble up," I meant don't make on_end() return Result, which would force synchronous error handling at every span.end() call. This isn't always possible when spans end via function scope (Drop). The callback approach does propagate errors to user code, but at the right granularity (export failures) without blocking the hot path. #3211 seems to be on right direction once we agree on using error handler for simple processor too.

But looking for more feedback here, I believe @cijothomas would be interested here too.

Oct 29 '25 19:10 lalitb

Excellent discussion. Overall, the plan was always to continue offer a callback so as to let app owners deal with failures, if they want to, and provide a default callback that does the internal logging (for the less advanced use cases). Totally supportive of the idea of letting user provide a callback to deal with errors themselves.

The only way this would be possible is to set up a tracing subscriber parser for error logs, which then parses logs and decides what to do when errors occur. While doable, this process is very error-prone because it would rely on specific strings to not change in the SDK logs. Also, it's expensive computationally because now we have to perform string pattern matching.

If we follow proper structured logging, then there should be no need of parsing strings - every internal log would have a well defined "EventName", and for each event, the schema (the keys used for the attributes) would be fixed. Making breaking change here would be disallowed. We have been striving to get to this state, but not there yet.

I want to give you a real-world example that caused me to open this issue. I'm currently developing an application that uses opentelemetry-sdk to export traces, meaning it's a client to an OTLP server. Let's say the client-server connection is secured with mTLS. The server rotates certs. Independently, the client re-establishes its connection to the server without the operator having a chance to update its mTLS settings. The result is data loss, unless the application itself is able to respond to the connection error and perform some type of retry buffering on the exports.

This should be solved irrespective. We just added Retry capabilities, with some minimal ability to configure it. I can imagine advanced users wanting to do have more control over this. Is this the only scenario or are there more? Could you share more, so we can discuss if we should solve the individual problems, instead of fully reworking the logging/callback.

(I'll review the opened PRs to share more specific feedbacks)

Oct 30 '25 18:10 cijothomas