opentelemetry-collector
opentelemetry-collector copied to clipboard
`obsreport`'s TracesDropped/MetricsDropped/LogsDropped are not actually used
Is your feature request related to a problem? Please describe.
I was asked in which circumstances *_dropped_spans metric can be observed and while searching for the answer I noticed there are no references in code to obsreport/obsreport_processor.go's TracesDropped/MetricsDropped/LogsDropped functions (except for unit tests). They can be used to generate processor's _dropped_spans/_dropped_metric_points/_dropped_log_records metrics respectively.
We use the "Refused" counterparts but not the "Dropped" ones. Perhaps we should remove those?
Also, TracesAccepted and TracesRefused refer to spans rather than traces and MetricsAccepted/MetricsRefused refer to metric points.
Describe the solution you'd like
SpansDropped/MetricsDropped/LogsDroppedremovedSpansAccepted/SpansRefusedinobsreportrather thanTracesAccepted/TracesRefused- (Perhaps)
MetricsDataPointsAccepted/MetricsDataPointsRefusedrather thanMetricsAccepted/MetricsRefused
There used to be, we might have lost it. Seems related to #5038
As discussed on Mar 23 - we should keep them and fix inconsistencies in them. It would be good to understand where/when their usage changed.
Clarification on this would be excellent as we do not observe and presume according to https://github.com/open-telemetry/opentelemetry-collector/issues/5038 that they will appear once such a situation happens.
I came to this issue, realizing a related problem exists. When the exporterhelper code has queuing enabled and encounters a full queue, it will drop the data and log a message about it. It returns nil to the obsreport logic that counts send-failures, and consequently we are counting drops as successful sends.
Conceptually, however, it is an exporter, not a processor, and somehow it crosses a logical barrier for exporterhelper to be counting drops. The only way exporters currently have for counting is success/failure, dropping is done in processors.
I'm aware of ongoing work to migrate batchprocessor into exporterhelper, but note that batchprocessor never drops data.
Having recently deployed a trace sampler processor, this has become a serious matter. When a processor drops data, it can be bad or it can be good--it's not always a failure. When the sampler processor drops a span, it's a normal situation. When a memorylimiter processor drops a span, it's a failure. I am beginning think that given how non-functional the existing metrics are, we have a good opportunity to redesign them now.
Please bring your attention to https://github.com/open-telemetry/semantic-conventions/pull/184, which is a discussion of how OTel will count success/failure/drops consistently across the entire telemetry pipeline. One thing that collector brings to this (larger) discussion worth mentioning is the notion of metrics level. My current thinking is that we should view "drops because of queue overflow" as a detailed form of "send failed", and we should only be able to distinguish local-queue-failure from remote-timeout and other retryable errors via metrics when level=detailed is set.
This was the last change to reference the Traces* in a processor: https://github.com/open-telemetry/opentelemetry-collector/commit/ccee4eb1e9e234f39b2ff0fc33563bdc15e900d5#diff-e88817c5bceb5f48f1d38dfdc37163fd2a1745d496b1597bce17ef27e47fa183
I was confused, in the comment above, by the issue identified here: https://github.com/open-telemetry/opentelemetry-collector/pull/8674#pullrequestreview-1672721521. Exporters that use exporterhelper are counting enqueue_failed metrics.
Am I right in thinking that this is still relevant? Looking at https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/processorhelper/obsreport.go it seems *Accepted and *Refused methods are all used in the memory limiter (and only there) but I don't see usages of *Dropped. Although I can see places in contrib that refer to dropped telemetry in different ways (none using those functions).
Probably related to https://github.com/open-telemetry/opentelemetry-collector/issues/7460 as there may be places where telemetry is currently dropped (e.g. by the exporter after retries are exhausted) but it shouldn't.
Also related to the docs that mention these dropped metrics as a way to monitor data loss https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/monitoring.md#data-loss
If the collector aims to keep retrying (after https://github.com/open-telemetry/opentelemetry-collector/issues/7460) can we assume then that the metric to monitor for data loss should be enqueue_failed (which already is a recommended) and refused (although the client will retry and it may not be actually data lost)? Should these methods still exist if telemetry is not meant to be dropped but rather "filtered" or unprocessable by certain processors?