opentelemetry-collector
opentelemetry-collector copied to clipboard
Add reason dimension to exporter and receiver failure metrics
Description
Adds reason attribute to otelcol_exporter_send_failed_* and otelcol_receiver_refused_* metrics
Link to tracking issue
Fixes #10157
Testing
TODO
Documentation
TODO
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: 0x006EA1E5 / name: Greg Eales (59f17fd23f84fe367f6e04a98aaa01a728e0eaa7)
In general things that are high cardinality like generic "errors" are not best suited for metrics, and usually they should just be recorded like logs or span attributes.
In general things that are high cardinality like generic "errors" are not best suited for metrics, and usually they should just be recorded like logs or span attributes.
The suggestion is to use the GRPC status code, not the actual error text, i.e., https://github.com/grpc/grpc-go/blob/master/codes/codes.go#L37
So the cardinality is around 17 at most.
Status code is commonly used as a metric dimension, for example for http metrics.
And typically, (in my experience of the collector), the actual number of statuses seen in responses will be much lower, so time series will not be generated for most of the possible values. Actually, in my experience, the error will normally be UNAVAILABLE, with a much lower number of UNKNOWN, DEADLINE_EXCEEDED, and RESOURCE_EXHAUSTED, so I wouldn't expect cardinality to increase by so much overall.
I'm also suggesting that we only add this dimension when the telemetry is configured as LevelDetailed, so users worried about cardinality have some control here.
The suggestion is to use the GRPC status code, not the actual error text, i.e., https://github.com/grpc/grpc-go/blob/master/codes/codes.go#L37
Why not accept the code then?
Why not accept the code then?
I don't understand. Do you mean use the numeric status code instead of the status code text?
I don't mind using either the numeric code, or the equivalent name, although I would think the name would be a bit more informative / easier to read.
And as the exporter could be using either GRPC or HTTP (or potentially another protocol), then the GRPC status code number may be a bit confusing.
Is there anything I can do to progress this?
You can join a SIG meeting, we have one for the collector in 10 minutes. It runs weekly.
There's an otep that discusses additional details around monitoring a telemetry pipeline https://github.com/open-telemetry/oteps/pull/259, might be worth taking a look there as well
This PR was marked stale due to lack of activity. It will be closed in 14 days.
Closed as inactive. Feel free to reopen if this PR is still being worked on.