semantic-conventions icon indicating copy to clipboard operation
semantic-conventions copied to clipboard

Processed/exported SDK metrics

Open carlosalberto opened this issue 2 years ago • 6 comments

Opening this issue to mainly get the ball rolling, as I have had users asking for metrics around processed/dropped/exported data (starting with traces, but following up with metrics/logs). I'd like to initially add the following metrics (some inspiration take by the current metrics in the Java SDK):

  • otel.exporter.exported, counter, with attributes:
    • success = true|false
    • type = span|metric|log
    • exporterType = <exporter type, e.g. GrpcSpanExporter>
  • otel.processor.processed, counter, with attributes:
    • dropped = true|false (buffer overflow)
    • type = span|metric|log
    • processorType = <processor typ, e.g. BatchSpanProcessor>

Albeit this is mostly targeted at SDKs, the Collector could use this as well - in which case we may want to add a component or pipeline.component attribute (or similar), to signal whether this is a SDK or a Collector.

carlosalberto avatar Jun 05 '23 15:06 carlosalberto

Do you intend to just introduce a semantic convention for this, or would this be added to the SDK specification (in https://github.com/open-telemetry/opentelemetry-specification) as well to ensure a consistent implementation? The relevant SDK spec parts are already stable, but this could be introduced as an optional feature.

arminru avatar Jun 06 '23 15:06 arminru

+1 on semconv, also this walks into the "namespaced attributes" debate.

jsuereth avatar Jun 06 '23 15:06 jsuereth

The relevant SDK spec parts are already stable, but this could be introduced as an optional feature.

I don't think that "stable" is that restrictive, but I think this would be best made optional anyway

Oberon00 avatar Jun 06 '23 15:06 Oberon00

This is exceptionally useful. We added hooks to enable metrics capture in the Ruby SDK a couple of years ago: https://github.com/open-telemetry/opentelemetry-ruby/pull/510. The metrics we defined include:

  • otel.otlp_exporter.request_duration
  • otel.otlp_exporter.failure ("soft" failure - request will be retried)
  • otel.bsp.buffer_utilization (a snapshot of "fullness" of the BSP buffer)
  • otel.bsp.export.success
  • otel.bsp.export.failure (hard failure - request will not be retried)
  • otel.bsp.exported_spans
  • otel.bsp.dropped_spans

At Shopify, we find these metrics very useful for monitoring the health of our trace collection pipeline. We have added these metrics in various hacky ways to other language SDKs (e.g. Go). It would be great to standardize them across SDK implementations.

fbogsany avatar Jun 06 '23 16:06 fbogsany

The Ruby SDK also reports compressed and uncompressed sizes of the batch before exporting. We have found this to be a better indicator of load on our collection infrastructure than span volume alone. We often feel the pain of this missing from other SDK implementations where we have not hacked it in.

robertlaurin avatar Jun 06 '23 16:06 robertlaurin

It would be nice if BSP would export following metrics:

otel.bsp.queue.capacity - Maximum size of queue (Gauge) otel.bsp.queue.size - Number of items in queue (Gauge) otel.bsp.queue.max_batch_size - Maximum size of a batch (Gauge) otel.bsp.queue.timeout - Timeout when batch is exported regardless of size (Gauge) otel.bsp.queue.exports - With labels reason=size|timeout (Counter)

Then its possible to build dashboards and alerts to detect problematic applications easily because its possible to compare size and capacity also its possible to see what triggers exports most timeouts or size hits.

tiithansen avatar Sep 05 '24 11:09 tiithansen

Hey!

We need for sure dropped spans as a metric. Lightstep opentracing library had support and this is a critical missing feature in order to migrate our services from OT to OTel.

A bit of a history: We had two times the issue that the tracing pipeline to the SaaS provider dropped all spans because we sent too many spans in a too short time frame and the system didn't recover. We recognized it only 24h later because the dashboard showed 0rps.

Having no visibility in a critical system like the tracing infrastructure is not an option.

szuecs avatar Jun 27 '25 14:06 szuecs