opentelemetry-rust icon indicating copy to clipboard operation
opentelemetry-rust copied to clipboard

Stabilization - OTLP

Open scottgerring opened this issue 5 months ago • 16 comments

This is a ticket to capture stabilization efforts around the OTLP crate. It will likely be composed of a review of the current state and some architectural documentation thereof, some proposed changes, and then the implementation of those changes.

We'll cover off everything in the compliance matrix for exporters here too.

scottgerring avatar Jul 11 '25 09:07 scottgerring

Open Issues & PRs related to OTLP as of 2025-07-10

Issues (latest updated first)

# Title (abridged) Updated Area
#3045 Simplify opentelemetry-proto (SDK decoupling, gRPC separation) 2025-06-30 Architecture
#2881 Allow async interceptors for OTLP gRPC exporter 2025-06-17 Extensibility
#2802 OTLP MetricExporter deadlock 2025-05-19 Metrics / runtime
#1987 Broken-pipe errors when exporting via OTLP 2025-03-31 Stability
#2877 Prevent telemetry-induced telemetry in OTLP exporter 2025-03-27 Performance
#2777 Handle shutdown in OTLP/gRPC 2025-03-11 Lifecycle
#2602 OTLP File Exporter 2025-02-27 Feature request
#1592 OTLP Exporter builder ergonomics 2024-12-14 API
#2364 2024-11-27 OTLP Exporter Retry & Persistence

PRs (latest updated first)

# Title Updated Purpose
PR #3046 SDK decoupling & gRPC separation 2025-07-06 Follow-up to issue #3045
PR #2812 Add shutdown for spans 2025-07-04 Lifecycle correctness
PR #2727 Add Retry to OTLP exporter 2025-06-25 Spec compliance (RetryInfo)
PR #3003 Guard against enabling multiple HTTP features 2025-05-27 Build-time safety
PR #2465 Configure TLS via env vars 2025-04-09 Ease of use
PR #2491 Fix (un)serialize edge-cases in metrics 2025-03-17 Data model
PR #2524 OTLP support matrix doc 2025-03-06 Documentation
PR #2758 Implement FromStr/Display for Protocol 2025-03-06 API polish

scottgerring avatar Jul 18 '25 12:07 scottgerring

I believe we're misaligned with these bits of the spec so far. Links to code where the thing would be happening and relevant bit of the spec.

Spec Compliance

Rank Spec Requirement (section) Current Status Action Importance Code Spec
1 MUST: Exporters respect RetryInfo / Retry-After (Retry) gRPC & HTTP ignore retry mechanisms. Implement exponential back-off based on metadata / header. High http/trace.rs#L49 / tonic/trace.rs#L87 https://opentelemetry.io/docs/specs/otel/protocol/exporter/#retry
2 MUST: Logs SeverityNumber 1-24 mapping Exporter doesn't validate; may emit invalid values (regular SDK mapping is correct, but direct OTLP construction could bypass validation). No action - existing indirection over serialization models sufficient High logs.rs#L148 https://opentelemetry.io/docs/specs/otel/logs/data-model/#field-severitynumber
3 SHOULD: Support gzip and zstd compression HTTP path lacks all compression support (both gzip and zstd). Implement compression for HTTP. Medium http/mod.rs#L47 https://opentelemetry.io/docs/specs/otel/protocol/exporter/#compression
4 SHOULD: Partial success body SHOULD be logged / propagated Both HTTP and gRPC discard partial success information. Bubble up info via ExporterStatus. Medium tonic/trace.rs#L90 https://opentelemetry.io/docs/specs/otlp/#partial-success

Additional Observations

Concurrent Export Limits

The OTLP specification recommends concurrent requests for high throughput:

implementations that need to achieve high throughput SHOULD support concurrent operations [...] The number of concurrent requests SHOULD be configurable.

As it stands we support concurrent requests, but do not provide a mechanism to limit them. In other languages this can be configured via the default HTTP client (e.g., java with OkHttp). Dotnet also considered what to do here some time ago.

Protocol Enum Ambiguity

OpenTelemetry Protocol Requirements: The OTLP specification supports multiple transport protocols: gRPC, HTTP/protobuf, and HTTP/JSON.

We model this using a fluent builder style API - first to select the transport protocol (note the feature gates):

https://github.com/open-telemetry/opentelemetry-rust/blob/c5fde17e7d17af9f2987b07f9d3f80c03291b7cc/opentelemetry-otlp/src/span.rs#L60-L61

https://github.com/open-telemetry/opentelemetry-rust/blob/c5fde17e7d17af9f2987b07f9d3f80c03291b7cc/opentelemetry-otlp/src/span.rs#L68-L70

Then, to select the encoding (rather confusingly called "protocol"):

https://github.com/open-telemetry/opentelemetry-rust/blob/c5fde17e7d17af9f2987b07f9d3f80c03291b7cc/opentelemetry-otlp/src/exporter/mod.rs#L258

https://github.com/open-telemetry/opentelemetry-rust/blob/c5fde17e7d17af9f2987b07f9d3f80c03291b7cc/opentelemetry-otlp/src/lib.rs#L440-L450

This means we can select impossible combinations which are silently remapped at runtime:

HTTP Transport + gRPC Protocol: with_http().with_protocol(Protocol::Grpc) uses HTTP transport with protobuf instead of actual gRPC wire format, silently accepting semantically incorrect configuration.

exporter/http/mod.rs:303 - Protocol::Grpc falls through to _ wildcard and gets HTTP+protobuf instead of failing

gRPC Transport + HTTP Protocol: with_tonic().with_protocol(Protocol::HttpJson) ignores protocol setting entirely, always uses gRPC regardless of specified protocol

exporter/tonic/mod.rs:140 - TonicExporterBuilder hard-codes protocol: Protocol::Grpc regardless of user configuration

Feature-Gated Runtime Behavior: Protocol::HttpJson compiles successfully but silently falls back to protobuf when http-json feature is disabled exporter/http/mod.rs:298-299 - #[cfg(feature = "http-json")] arm disappears at compile time without the feature

Graceful gRPC Shutdown (#2777)

The current shutdown implementation simply takes the ClientInner via self.inner.take() without explicit client connection cleanup, and Tonic lacks direct connection closing APIs for clients. This can leave gRPC client connections and resources improperly cleaned up in production systems where exporters are frequently created and destroyed.

scottgerring avatar Jul 18 '25 12:07 scottgerring

Exporter doesn't validate; Logs SeverityNumber 1-24 mapping

I am not sure if Exporter should do anything to validate..Any reason why you think exporter must do this? (In my view, Exporters job is to serialize and transport the data only, not ensuring the validity...)

cijothomas avatar Jul 18 '25 15:07 cijothomas

Summary is legacy and may not be required).

Summary is not required. There is no way to produce them in the SDK also.

cijothomas avatar Jul 18 '25 15:07 cijothomas

Summary is not required. There is no way to produce them in the SDK also.

Removed!

Exporter doesn't validate; Logs SeverityNumber 1-24 mapping

I am not sure if Exporter should do anything to validate..Any reason why you think exporter must do this? (In my view, Exporters job is to serialize and transport the data only, not ensuring the validity...)

I don't think it's strictly necessary, but the "robustness principle" would encourage us to. Having said that I can't see an ergonomic way of doing it - we want the types to be pub so that folks can use them directly, which we see in other requests to the project, and I don't think we want code in the exporters looping over the records and mutating values.

scottgerring avatar Jul 22 '25 08:07 scottgerring

I'm working through the points in the compliance matrix to reconfirm what else we need. There's a few things in here it seems like we now support. @cijothomas is it worth updating the matrix?

Feature Rust Complete? Notes
Exporter interface has ForceFlush - OTLP exporters lack ForceFlush implementation - see trace exporter
OTLP/HTTP JSON Protobuf Exporter Protocol::HttpJson handling with http-json feature
OTLP/HTTP gzip Content-Encoding support Only gRPC compression via gzip-tonic feature, HTTP lacks compression
Honors retryable responses with backoff No retry logic in HTTP trace exporter
Honors non-retryable responses All errors treated generically in HTTP exporters
Honors throttling response No 429 throttling handling in HTTP exporters
Multi-destination spec compliance Single endpoint per exporter in builder pattern
SchemaURL in ResourceSpans and ScopeSpans Resource and Scope schema_url mapping in trace transformations
SchemaURL in ResourceMetrics and ScopeMetrics Resource and Scope schema_url mapping in metrics transformations
SchemaURL in ResourceLogs and ScopeLogs Resource and Scope schema_url mapping in logs transformations
Honors the user agent spec Default User-Agent header construction with proper format
Partial Success messages are handled and logged for OTLP/gRPC Proto definitions exist but no handling in gRPC trace exporter
Partial Success messages are handled and logged for OTLP/HTTP Response parsing missing in HTTP trace exporter
Metric Exporter configurable temporality preference MetricExporterBuilder::with_temporality() implementation
Metric Exporter configurable default aggregation Missing AggregationSelector trait and builder configuration

scottgerring avatar Jul 22 '25 10:07 scottgerring

I'm working through the points in the compliance matrix to reconfirm what else we need. There's a few things in here it seems like we now support. @cijothomas is it worth updating the matrix?

Yes, let's update the matrix to keep it accurate. I also suggest to focus only on OTLP and ignore Zipkin and Prometheus now.

No in-memory test exporter provided in OTLP crate

Its not part of OTLP crate, and not required to be either.

cijothomas avatar Jul 22 '25 15:07 cijothomas

Concurrent Export Limits

lets not worry about it for now in the OTLP Exporters, as upstream sdk (BatchProcessor) only calls export sequentially. We need to first address that issue, and then come back to add concurrent export in OTLP. https://github.com/open-telemetry/opentelemetry-rust/pull/3028 adds concurrent_export for Spans, but only for the experimental feature, not the default Batchprocessor.

cijothomas avatar Jul 22 '25 19:07 cijothomas

Metric Exporter configurable default aggregation

I'd treat this as very low priority, and move it outside scope for stable release. It is not something we have seen asks for. (Also using Views + experimental feature, this can be achieved on a per metric basis)

cijothomas avatar Jul 22 '25 19:07 cijothomas

Blocking for stable

Dealing with responses and failures ...

Feature Brief description (spec vs. current gap) Issue / PR Notes
Retry & Throttling Exporter MUST honor RetryInfo / Retry‑After for both generic retryable errors and throttling signals (HTTP 429 / gRPC resource_exhausted), applying exponential back‑off. gRPC & HTTP paths ignore this today. #3126 Done
Partial success Exporter SHOULD surface and handle OTLP “partial success” codes & messages instead of silently discarding them. #3206 Done

API cleanliness:

Feature Brief description (spec vs. current gap) Issue / PR Notes
Protocol enum ambiguity Builder lets users pick impossible combos (e.g. HTTP + Protocol::Grpc) and silently remaps them. #2758 PR polishes enum conversions; still missing upfront validation/error.

Misc:

Feature Brief description (spec vs. current gap) Issue / PR Notes
gzip & zstd compression HTTP exporter should offer both gzip and zstd; only gRPC supports gzip today. #3083 Done
Telemetry‑induced telemetry guard Exporter should avoid emitting telemetry for its own network calls to prevent feedback loops. #2877 Issue tracks guard logic; cross‑thread libs (hyper, tonic) still leak context. @cijothomas looking at this currently.

Misc / non-blocking

Feature Brief description (spec vs. current gap) Issue / PR Notes
ForceFlush Exporters should implement force_flush() per SDK spec; currently missing for all signals. Needs unified async/timeout design.
Multi‑destination endpoints Spec allows exporters to target multiple OTLP endpoints; Rust builder supports only one. Requires vector of endpoint configs plus round‑robin / fail‑over handling.

scottgerring avatar Jul 24 '25 12:07 scottgerring

@cijothomas I think that list at the bottom captures everything we've discussed being blocking. LMK if i've missed anything, otherwise it's time to start working on it!

scottgerring avatar Jul 24 '25 12:07 scottgerring

#3110 also related

scottgerring avatar Aug 04 '25 10:08 scottgerring

Removed #2364 as it is not specced and not blocking for OTLP stabilization.

scottgerring avatar Nov 10 '25 08:11 scottgerring

We've done everything we identified as blocking based on the earlier walk through of the spec against the implementation. The big things from the spec gap analysis - retries, partial success, compression - are done. What are we missing?

Potentially related outstanding OTLP issues:

Maybe blocking

  • #3045
  • #2994 - should be a quick fix; i'll pick this up now
  • #2877 - not sure of the status of this - @cijothomas / @bantonsson I know some work has gone into it ?
  • #2802 - this is a bit nerve wracking
  • #2262 - this looks almost done. i'll pick it up and close it.

Not Blocking

  • #3123 - I don't think this is likely to force a public API change, so I don't think it should block
  • #2881 - this is an enhancement request and goes beyond the spec
  • #2576

@cijothomas @lalitb what of this is blocking to call OTLP stable ? What else are we missing?

scottgerring avatar Dec 05 '25 10:12 scottgerring

Hi @scottgerring. Is there anything I can help contribute in? I'm new to the OTLP scene, so might need some guidance. But happy to help put in my time if I can get some mentoring

rakshith-ravi avatar Dec 05 '25 10:12 rakshith-ravi

Hey @rakshith-ravi - i'm hoping everything blocking stability for OTLP is done but i'll ping you if it's not - thanks for asking! in the meantime you might wanna check out the good first issue issues :)

scottgerring avatar Dec 05 '25 12:12 scottgerring