opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

Add reason dimension to exporter and receiver failure metrics

Open 0x006EA1E5 opened this issue 1 year ago • 10 comments
trafficstars

Description

Adds reason attribute to otelcol_exporter_send_failed_* and otelcol_receiver_refused_* metrics

Link to tracking issue

Fixes #10157

Testing

TODO

Documentation

TODO

0x006EA1E5 avatar May 15 '24 13:05 0x006EA1E5

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: 0x006EA1E5 / name: Greg Eales (59f17fd23f84fe367f6e04a98aaa01a728e0eaa7)

In general things that are high cardinality like generic "errors" are not best suited for metrics, and usually they should just be recorded like logs or span attributes.

bogdandrutu avatar May 15 '24 16:05 bogdandrutu

In general things that are high cardinality like generic "errors" are not best suited for metrics, and usually they should just be recorded like logs or span attributes.

The suggestion is to use the GRPC status code, not the actual error text, i.e., https://github.com/grpc/grpc-go/blob/master/codes/codes.go#L37

So the cardinality is around 17 at most.

Status code is commonly used as a metric dimension, for example for http metrics.

And typically, (in my experience of the collector), the actual number of statuses seen in responses will be much lower, so time series will not be generated for most of the possible values. Actually, in my experience, the error will normally be UNAVAILABLE, with a much lower number of UNKNOWN, DEADLINE_EXCEEDED, and RESOURCE_EXHAUSTED, so I wouldn't expect cardinality to increase by so much overall.

I'm also suggesting that we only add this dimension when the telemetry is configured as LevelDetailed, so users worried about cardinality have some control here.

0x006EA1E5 avatar May 16 '24 08:05 0x006EA1E5

The suggestion is to use the GRPC status code, not the actual error text, i.e., https://github.com/grpc/grpc-go/blob/master/codes/codes.go#L37

Why not accept the code then?

bogdandrutu avatar May 20 '24 09:05 bogdandrutu

Why not accept the code then?

I don't understand. Do you mean use the numeric status code instead of the status code text?

0x006EA1E5 avatar May 20 '24 14:05 0x006EA1E5

I don't mind using either the numeric code, or the equivalent name, although I would think the name would be a bit more informative / easier to read.

And as the exporter could be using either GRPC or HTTP (or potentially another protocol), then the GRPC status code number may be a bit confusing.

0x006EA1E5 avatar May 23 '24 16:05 0x006EA1E5

Is there anything I can do to progress this?

0x006EA1E5 avatar Jun 05 '24 11:06 0x006EA1E5

You can join a SIG meeting, we have one for the collector in 10 minutes. It runs weekly.

atoulme avatar Jun 05 '24 15:06 atoulme

There's an otep that discusses additional details around monitoring a telemetry pipeline https://github.com/open-telemetry/oteps/pull/259, might be worth taking a look there as well

codeboten avatar Jun 05 '24 16:06 codeboten

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Jun 21 '24 03:06 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions[bot] avatar Jul 06 '24 03:07 github-actions[bot]