[prometheus] "Error on ingesting out-of-order exemplars" message in logs
Bug Report
The following error continues to show up in the Prometheus logs:
prometheus | ts=2024-11-26T04:44:06.414Z caller=write_handler.go:279 level=warn component=web msg="Error on ingesting out-of-order exemplars" num_dropped=44
This error started happening after upgrading flagd to version 0.11.4 in this PR
@beeme1mr This issue in particular did not exists with prior versions of flagd
I did an incremental change to 0.11.3, and saw this error stopped happening, so I suspect this is something very specific to the 0.11.3 -> 0.11.4 upgrade. The release notes for flagd, are not super obvious on why this error would occur outside of a prometheus client upgrade.
We didn't make any telemetry-related changes in the last release but I'll look into it.
~~Are you sure this issue is caused by flagd? It's not clear from the attached logs.~~
Sorry, reread your comment.
I can't get the demo app running locally due to an unrelated error.
⠋ Container otel-col Creating 0.1s
Error response from daemon: invalid mount config: must use either propagation mode "rslave" or "rshared" when mount source is within the daemon root, daemon root: "/var/lib/docker", bind mount source: "/", propagation: "rprivate"
make: *** [Makefile:138: start] Error 1
Another user reported this already, but the fix has been reverted.
Possibly related.
https://github.com/prometheus/prometheus/issues/13933
I can't get the demo app running locally due to an unrelated error.
⠋ Container otel-col Creating 0.1s Error response from daemon: invalid mount config: must use either propagation mode "rslave" or "rshared" when mount source is within the daemon root, daemon root: "/var/lib/docker", bind mount source: "/", propagation: "rprivate" make: *** [Makefile:138: start] Error 1Another user reported this already, but the fix has been reverted.
@beeme1mr the latest release doesn't have that anymore. Have you pulled the latest?
Yeah, I'm trying to run the latest version of the demo in Ubuntu on WSL.
Ah, I see. Re-reading your message it actually makes sense. We have removed the rslave param because it is not required anymore in the latest docker version. It seemed to be an issue that happened in one single version.
Multiple users reported that they were facing issues to run with the rslave param.
Could you check if updating docker solves for you? If not, maybe you could edit the docker-compose.yaml file locally.
But ideally the demo would run in all setups.
I'm running the latest version available through apt-get. However, it's not the latest version according to the Docker release notes. I'll check again tomorrow.
Looks like flagd 0.11.4 contained an updated of flagd/core to 0.10.3 which itself contained a change from 1.28.0 to 1.30.0 of the opentelemetry-go monorepo. This definitely seems like a possible cause of the issue. Not sure what all is in that change but it doesn't look like it is flagd's fault, since they're just using basic APIs and not doing anything overly fancy. Indeed, they're not doing anything specific to exemplars at all.
@open-telemetry/go-maintainers is there any chance there is a known issue which may have caused this?
Exemplars were enabled by default in 1.31.0, so that wouldn't have changed between 1.28.0 and 1.30.0. But if it was 1.31, that would possibly explain it.
From https://github.com/prometheus/prometheus/blob/4a6f8704efcabfe9ee0f74eab58d4c11579547be/tsdb/exemplar.go#L257:
Since during the scrape the exemplars are sorted first by timestamp, then value, then labels, if any of these conditions are true, we know that the exemplar is either a duplicate of a previous one (but not the most recent one as that is checked above) or out of order.
So sounds like this could be out of order or a duplicate issue.
Are we sending OTLP to Prometheus? Or are we exporting prometheus or PRW from the collector?
Are we sending OTLP to Prometheus? Or are we exporting prometheus or PRW from the collector?
we are sending OTLP to Prometheus
Got it. So it is probably an issue with the implementation of exemplar translation in the OTLP receiver of the prometheus server. The exemplar validation code probably assumes things about exemplars that aren't correct for OTel exemplars.
@puckpuck if you can add details of our setup to https://github.com/prometheus/prometheus/issues/13933, that would be helpful. Some hypotheses to check:
- Is the issue triggered by counters with multiple exemplars (possibly with out-of-order timestamps?)
- Is the issue triggered by histograms with exemplars that don't align with histogram bucket boundaries?
- Is the issue triggered by exponential histograms with exemplars that aren't in timestamped order?