mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Rejecting valid exemplars with err-mimir-exemplar-labels-missing

Open gouthamve opened this issue 1 year ago • 3 comments

Describe the bug

Seeing errors like:

ts=2024-04-14T14:27:50.976027765Z caller=push.go:171 level=error user=12690 msg="push error" err="received an exemplar with no valid labels, timestamp: 0 series: blackbox_module_unknown_total{job=\"unknown_service:blackbox_exporter\"} labels: {} (err-mimir-exemplar-labels-missing)"

From: https://github.com/open-telemetry/opentelemetry-go-contrib/issues/5383#issuecomment-2055549443

According to the OpenMetrics spec it's ok to have Exemplars without labels:

Exemplars without Labels MUST represent an empty LabelSet as {}.

There's even an example in the spec: foo_bucket{le="0.1"} 8 # {} 0.054 Why does Mimir reject this?

To Reproduce

Send an exemplar with no labels to Grafana Cloud.

Expected behavior

Return a 200 OK, not a 400 error.

gouthamve avatar Apr 15 '24 08:04 gouthamve

The Prometheus TSDB code drops exemplars with no labels which is why Mimir rejects them on ingest. It wasn't a problem in practice until now because Prometheus never sends exemplars with no labels.

56quarters avatar Apr 16 '24 13:04 56quarters

Discussion here when some of the logic to reject empty exemplars was added: https://github.com/grafana/mimir/pull/873

56quarters avatar Apr 16 '24 14:04 56quarters

@gouthamve since this behaviour is by design in Prometheus, can the issue be closed? Or do you think it should be reconsidered in Prom?

aknuds1 avatar Apr 18 '24 12:04 aknuds1

I think it needs to be reconsidered in Prometheus. We should atleast not drop them with 400s because OTel seems produce a ton of these empty exemplars.

When that happens, the customers alerts start firing because their writes are failing, and the logs are full of the err-mimir-exemplar-labels-missing error message. This is a sub-par experience for customers.

gouthamve avatar May 20 '24 14:05 gouthamve

I'm curious, what's the use case for an exemplar without labels?

colega avatar May 20 '24 14:05 colega

From an OTel and specification perspective, its "example values". For this histogram, these are the example values we recorded. @RichiH / @fstab can you chime in further?

gouthamve avatar May 20 '24 14:05 gouthamve

:+1: yes, I think it's just example values.

As it's defined in the OpenMetrics spec, and in the OpenTelemetry spec, we should support this. It's maybe ok to ignore the Exemplar itself, but dropping the entire time series is certainly not spec compliant.

fstab avatar May 21 '24 07:05 fstab

I think that dropping the entire time series is a critical bug that deserves a separate issue.

I also think that we should not reject the empty exemplars just because Prometheus doesn't send those.

colega avatar May 21 '24 07:05 colega

I'm curious, what's the use case for an exemplar without labels?

When visualized, the values give a better idea of distribution. Especially when buckets are coarse, exemplars can suggest you have a particular modal value, or bimodal distribution, etc.

bboreham avatar Jun 04 '24 10:06 bboreham

Thank you, @bboreham, I figured out that myself :D

colega avatar Jun 04 '24 10:06 colega

I created a feature request upstream https://github.com/prometheus/prometheus/issues/14208

bboreham avatar Jun 04 '24 10:06 bboreham

The Prometheus TSDB code drops exemplars with no labels

I believe this was a misunderstanding. Nick's source is me at #873, where I said it would be confusing. I was originally proposing that Mimir should reject an exemplar with labels like {trace_id=""}, which will be stored as {}.

bboreham avatar Jun 05 '24 16:06 bboreham

Closed in #8224

gouthamve avatar Jun 26 '24 13:06 gouthamve