opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

Add 'reason' attribute to `otelcol_exporter_send_failed_*` metrics

Open 0x006EA1E5 opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe. I am interested in monitoring data loss which occurs when exporting data from one instance of the Collector to another, specifically using the loadbalancingexporter.

At the moment I just see a course grained metric which counts the export failures, but gives me no data on the cause. Was it a permanent or retryable error? Was it a badly configured endpoint, or did the downstream receiver actively reject the data?

I can look into the logs to see info on specific failures, but this is tedious and less easy to understand.

Describe the solution you'd like

I propose that we add a reason dimension to the otelcol_exporter_send_failed_* metrics. This reason could be the GRPC status of the response (I understand that GRPC status is uses as the internal representation of these kind of problems).

Describe alternatives you've considered

It is possible to try to correlate export failure metrics with downstream receiver error metrics. We can also try to correlate with "know failure causes", such as memorylimiterprocessor errors, which could mean the upstream export failed.

We can also check the logs, and even - depending on the system - extra metrics from theses logs.

However, this is all much harder work

Additional context

We could also consider adding a similar attribute to the otelcol_receiver_refused_spans metric.

I have had a look at the code, and it seems like a fairly small change in / around exporter/exporterhelper/obsexporter.go

0x006EA1E5 avatar May 15 '24 13:05 0x006EA1E5

/label area:exporter exporter/otlp exporter/otlphttp receiver/otlp area:receiver

0x006EA1E5 avatar May 15 '24 13:05 0x006EA1E5