prometheus Corrupting data written to remote storage in case sample_age

What did you do?

We've started experiencing data corruption since we have started using remote_write.queue_config.sample_age_limit. We want to drop old samples after our Disaster recovery tests, in order to make sure we see fresh data as soon as possible (and we do not care much about samples scraped during the test itself). After these tests (which is basically the only time when sample_age_limit applies as we try to make sure we do not hit it during normal operation) we have seen unexpected counter resets as well as new, unexpected timeseries.

What did you expect to see?

Hitting sample_age_limit should only drop old data.

What did you see instead? Under which circumstances?

Our production setup is Prometheus with remote write directed towards Mimir cluster. In case when remote write endpoint is not accessible for more than sample_age_limit, we have seen corruptions in data ingested to remote storage. In both cases, just 2-3 samples just after remote write endpoint resumes its operation, and Prometheus drops old data and starts ingesting again.

So far we have notices two cases:

unexpected samples gets appended to an existing timeseries. Timestamp does not align with our scrape_interval, and value also does not align the timeseries. It probably happens to any metric type, but we have noticed this only with counters as the "fake" values are interpreted as a counter reset, causing huge spikes in our graphs.
new timeseries is created, with just a few samples just after the remote_write resume to ingest data to remote storage just like in the previous case. We have noticed it for cases which are clearly nonsense. It seems that existing series get mixed, like in this case when we tested it with Prometheus holding only its own exposed metrics and node_exporter - go_gc_duration_seconds_count{cluster="local-test", instance="localhost:9100", job="node-exporter", name="systemd-networkd.service", state="activating", type="notify-reload"}

The above we have been able to reproduce in setup with localy running Prometheus binary with configured remote_write towards our staging Mimir cluster. The outage have been simulated by iptables -I OUTPUT -d __IP__ -j DROP.

We have also tried to reproduce it in setup with two Prometheis with one serving as a remote_write receiver (see the attached docker-compose). In this scenario, we haven't seen any corrupted data in the receiving Prometheus, but we are getting log complaining about corrupted data:

System information

Linux 6.5.0-27-generic x86_64

Prometheus version

prometheus, version 2.50.1 (branch: HEAD, revision: 8c9b0285360a0b6288d76214a75ce3025bce4050)
  build user:       root@6213bb3ee580
  build date:       20240226-11:36:26
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo,builtinassets,stringlabels

Prometheus configuration file

global:
        scrape_interval: 30s
        scrape_timeout: 10s
        scrape_protocols:
        - OpenMetricsText1.0.0
        - OpenMetricsText0.0.1
        - PrometheusText0.0.4
        evaluation_interval: 30s
        external_labels:
          __replica__: prometheus-local-test-0
          cluster: local-test
      scrape_configs:
      - job_name: prometheus-scif-monitoring
        honor_timestamps: false
        track_timestamps_staleness: false
        scrape_interval: 30s
        scrape_timeout: 10s
        scrape_protocols:
        - OpenMetricsText1.0.0
        - OpenMetricsText0.0.1
        - PrometheusText0.0.4
        metrics_path: /metrics
        scheme: http
        enable_compression: false
        follow_redirects: false
        enable_http2: false
        static_configs:
        - targets:
          - localhost:9090
      - job_name: node-exporter
        honor_timestamps: false
        track_timestamps_staleness: false
        scrape_interval: 5s
        scrape_timeout: 1s
        scrape_protocols:
        - OpenMetricsText1.0.0
        - OpenMetricsText0.0.1
        - PrometheusText0.0.4
        scheme: http
        enable_compression: false
        follow_redirects: false
        enable_http2: false
        metrics_path: /metrics
        static_configs:
        - targets:
          - node-exporter:9100
      remote_write:
      - url: http://receive-prometheus:9090/api/v1/write
        remote_timeout: 12s
        follow_redirects: true
        enable_http2: true
        queue_config:
          capacity: 100
          max_shards: 3
          min_shards: 12
          batch_send_deadline: 5s
          min_backoff: 30ms
          max_backoff: 5s
          sample_age_limit: 1m
        metadata_config:
          send: true
          send_interval: 1m
          max_samples_per_send: 100

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

prometheus          | ts=2024-04-23T10:06:18.038Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://receive-prometheus:9090/api/v1/write\": context deadline exceeded"
prometheus          | ts=2024-04-23T10:07:26.043Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://receive-prometheus:9090/api/v1/write\": context deadline exceeded"
prometheus          | ts=2024-04-23T10:08:34.050Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://receive-prometheus:9090/api/v1/write\": context deadline exceeded"
prometheus          | ts=2024-04-23T10:09:11.336Z caller=dedupe.go:112 component=remote level=info remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Remote storage resharding" from=12 to=3
receive-prometheus  | ts=2024-04-23T10:09:11.998Z caller=write_handler.go:77 level=error component=web msg="Error appending remote write" err="label name \"job\" is not unique: invalid sample"
receive-prometheus  | ts=2024-04-23T10:09:11.998Z caller=write_handler.go:77 level=error component=web msg="Error appending remote write" err="label name \"job\" is not unique: invalid sample"
prometheus          | ts=2024-04-23T10:09:12.029Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: label name \"job\" is not unique: invalid sample"
receive-prometheus  | ts=2024-04-23T10:09:12.035Z caller=write_handler.go:134 level=error component=web msg="Out of order sample from remote write" err="duplicate sample for timestamp" series="{__name__=\"go_gc_duration_seconds\", __replica__=\"prometheus-local-test-0\", cluster=\"local-test\", instance=\"node-exporter:9100\", job=\"node-exporter\"}" timestamp=1713866946725
receive-prometheus  | ts=2024-04-23T10:09:12.035Z caller=write_handler.go:77 level=error component=web msg="Error appending remote write" err="label name \"job\" is not unique: invalid sample"
prometheus          | ts=2024-04-23T10:09:12.035Z caller=dedupe.go:112 component=remote level=error remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="non-recoverable error" count=2000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: duplicate sample for timestamp"

Apr 23 '24 18:04 david-vavra

Attached is the docker compose we use to replicate the issue. docker-compose.yml

EDIT: Steps to reproduce:

run the attached compose
simulate outage of remote write receiving endpoints date; docker pause mimir receive-prometheus
wait for at least sample_age_limit
resume remote write date; docker unpause mimir receive-prometheus

Apr 23 '24 18:04 david-vavra

@marctc haven't you experienced similar issues? You might have the context as author of this feature 🙏

May 01 '24 09:05 FUSAKLA

Attempt to fix this https://github.com/prometheus/prometheus/pull/14078

May 10 '24 23:05 FUSAKLA

Here is a PR which reproduces the bug https://github.com/prometheus/prometheus/pull/14157

Jun 13 '24 11:06 FUSAKLA

@cstyan Last version was cutted just a few days ago. Do you think we can expect a fix/patch version to be released with the fix, or it won't be released until next minor (2.54.0)?

Jun 24 '24 10:06 david-vavra

Corrupting data written to remote storage in case sample_age_limit is hit

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs