opentelemetry-demo icon indicating copy to clipboard operation
opentelemetry-demo copied to clipboard

[prometheus] "Error on ingesting out-of-order exemplars" message in logs

Open puckpuck opened this issue 1 year ago • 13 comments

Bug Report

The following error continues to show up in the Prometheus logs:

prometheus  | ts=2024-11-26T04:44:06.414Z caller=write_handler.go:279 level=warn component=web msg="Error on ingesting out-of-order exemplars" num_dropped=44

This error started happening after upgrading flagd to version 0.11.4 in this PR

puckpuck avatar Nov 26 '24 04:11 puckpuck

@beeme1mr This issue in particular did not exists with prior versions of flagd

I did an incremental change to 0.11.3, and saw this error stopped happening, so I suspect this is something very specific to the 0.11.3 -> 0.11.4 upgrade. The release notes for flagd, are not super obvious on why this error would occur outside of a prometheus client upgrade.

puckpuck avatar Nov 26 '24 04:11 puckpuck

We didn't make any telemetry-related changes in the last release but I'll look into it.

~~Are you sure this issue is caused by flagd? It's not clear from the attached logs.~~

Sorry, reread your comment.

beeme1mr avatar Nov 26 '24 14:11 beeme1mr

I can't get the demo app running locally due to an unrelated error.

 ⠋ Container otel-col          Creating                                                                                                                                                                      0.1s
Error response from daemon: invalid mount config: must use either propagation mode "rslave" or "rshared" when mount source is within the daemon root, daemon root: "/var/lib/docker", bind mount source: "/", propagation: "rprivate"
make: *** [Makefile:138: start] Error 1

Another user reported this already, but the fix has been reverted.

beeme1mr avatar Nov 26 '24 18:11 beeme1mr

Possibly related.

https://github.com/prometheus/prometheus/issues/13933

beeme1mr avatar Nov 26 '24 18:11 beeme1mr

I can't get the demo app running locally due to an unrelated error.

 ⠋ Container otel-col          Creating                                                                                                                                                                      0.1s
Error response from daemon: invalid mount config: must use either propagation mode "rslave" or "rshared" when mount source is within the daemon root, daemon root: "/var/lib/docker", bind mount source: "/", propagation: "rprivate"
make: *** [Makefile:138: start] Error 1

Another user reported this already, but the fix has been reverted.

@beeme1mr the latest release doesn't have that anymore. Have you pulled the latest?

julianocosta89 avatar Nov 26 '24 20:11 julianocosta89

Yeah, I'm trying to run the latest version of the demo in Ubuntu on WSL.

beeme1mr avatar Nov 26 '24 21:11 beeme1mr

Ah, I see. Re-reading your message it actually makes sense. We have removed the rslave param because it is not required anymore in the latest docker version. It seemed to be an issue that happened in one single version.

Multiple users reported that they were facing issues to run with the rslave param.

Could you check if updating docker solves for you? If not, maybe you could edit the docker-compose.yaml file locally.

But ideally the demo would run in all setups.

julianocosta89 avatar Nov 26 '24 21:11 julianocosta89

I'm running the latest version available through apt-get. However, it's not the latest version according to the Docker release notes. I'll check again tomorrow.

beeme1mr avatar Nov 26 '24 21:11 beeme1mr

Looks like flagd 0.11.4 contained an updated of flagd/core to 0.10.3 which itself contained a change from 1.28.0 to 1.30.0 of the opentelemetry-go monorepo. This definitely seems like a possible cause of the issue. Not sure what all is in that change but it doesn't look like it is flagd's fault, since they're just using basic APIs and not doing anything overly fancy. Indeed, they're not doing anything specific to exemplars at all.

@open-telemetry/go-maintainers is there any chance there is a known issue which may have caused this?

dyladan avatar Dec 18 '24 13:12 dyladan

Exemplars were enabled by default in 1.31.0, so that wouldn't have changed between 1.28.0 and 1.30.0. But if it was 1.31, that would possibly explain it.

From https://github.com/prometheus/prometheus/blob/4a6f8704efcabfe9ee0f74eab58d4c11579547be/tsdb/exemplar.go#L257:

Since during the scrape the exemplars are sorted first by timestamp, then value, then labels, if any of these conditions are true, we know that the exemplar is either a duplicate of a previous one (but not the most recent one as that is checked above) or out of order.

So sounds like this could be out of order or a duplicate issue.

Are we sending OTLP to Prometheus? Or are we exporting prometheus or PRW from the collector?

dashpole avatar Dec 18 '24 14:12 dashpole

Are we sending OTLP to Prometheus? Or are we exporting prometheus or PRW from the collector?

we are sending OTLP to Prometheus

puckpuck avatar Dec 19 '24 03:12 puckpuck

Got it. So it is probably an issue with the implementation of exemplar translation in the OTLP receiver of the prometheus server. The exemplar validation code probably assumes things about exemplars that aren't correct for OTel exemplars.

dashpole avatar Dec 19 '24 19:12 dashpole

@puckpuck if you can add details of our setup to https://github.com/prometheus/prometheus/issues/13933, that would be helpful. Some hypotheses to check:

  • Is the issue triggered by counters with multiple exemplars (possibly with out-of-order timestamps?)
  • Is the issue triggered by histograms with exemplars that don't align with histogram bucket boundaries?
  • Is the issue triggered by exponential histograms with exemplars that aren't in timestamped order?

dashpole avatar Dec 19 '24 19:12 dashpole