Corrupting data written to remote storage in case sample_age_limit is hit
What did you do?
We've started experiencing data corruption since we have started using remote_write.queue_config.sample_age_limit. We want to drop old samples after our Disaster recovery tests, in order to make sure we see fresh data as soon as possible (and we do not care much about samples scraped during the test itself). After these tests (which is basically the only time when sample_age_limit applies as we try to make sure we do not hit it during normal operation) we have seen unexpected counter resets as well as new, unexpected timeseries.
What did you expect to see?
Hitting sample_age_limit should only drop old data.
What did you see instead? Under which circumstances?
Our production setup is Prometheus with remote write directed towards Mimir cluster. In case when remote write endpoint is not accessible for more than sample_age_limit, we have seen corruptions in data ingested to remote storage. In both cases, just 2-3 samples just after remote write endpoint resumes its operation, and Prometheus drops old data and starts ingesting again.
So far we have notices two cases:
- unexpected samples gets appended to an existing timeseries. Timestamp does not align with our scrape_interval, and value also does not align the timeseries. It probably happens to any metric type, but we have noticed this only with counters as the "fake" values are interpreted as a counter reset, causing huge spikes in our graphs.
- new timeseries is created, with just a few samples just after the remote_write resume to ingest data to remote storage just like in the previous case. We have noticed it for cases which are clearly nonsense. It seems that existing series get mixed, like in this case when we tested it with Prometheus holding only its own exposed metrics and node_exporter -
go_gc_duration_seconds_count{cluster="local-test", instance="localhost:9100", job="node-exporter", name="systemd-networkd.service", state="activating", type="notify-reload"}
The above we have been able to reproduce in setup with localy running Prometheus binary with configured remote_write towards our staging Mimir cluster. The outage have been simulated by iptables -I OUTPUT -d __IP__ -j DROP.
We have also tried to reproduce it in setup with two Prometheis with one serving as a remote_write receiver (see the attached docker-compose). In this scenario, we haven't seen any corrupted data in the receiving Prometheus, but we are getting log complaining about corrupted data:
System information
Linux 6.5.0-27-generic x86_64
Prometheus version
prometheus, version 2.50.1 (branch: HEAD, revision: 8c9b0285360a0b6288d76214a75ce3025bce4050)
build user: root@6213bb3ee580
build date: 20240226-11:36:26
go version: go1.21.7
platform: linux/amd64
tags: netgo,builtinassets,stringlabels
Prometheus configuration file
global:
scrape_interval: 30s
scrape_timeout: 10s
scrape_protocols:
- OpenMetricsText1.0.0
- OpenMetricsText0.0.1
- PrometheusText0.0.4
evaluation_interval: 30s
external_labels:
__replica__: prometheus-local-test-0
cluster: local-test
scrape_configs:
- job_name: prometheus-scif-monitoring
honor_timestamps: false
track_timestamps_staleness: false
scrape_interval: 30s
scrape_timeout: 10s
scrape_protocols:
- OpenMetricsText1.0.0
- OpenMetricsText0.0.1
- PrometheusText0.0.4
metrics_path: /metrics
scheme: http
enable_compression: false
follow_redirects: false
enable_http2: false
static_configs:
- targets:
- localhost:9090
- job_name: node-exporter
honor_timestamps: false
track_timestamps_staleness: false
scrape_interval: 5s
scrape_timeout: 1s
scrape_protocols:
- OpenMetricsText1.0.0
- OpenMetricsText0.0.1
- PrometheusText0.0.4
scheme: http
enable_compression: false
follow_redirects: false
enable_http2: false
metrics_path: /metrics
static_configs:
- targets:
- node-exporter:9100
remote_write:
- url: http://receive-prometheus:9090/api/v1/write
remote_timeout: 12s
follow_redirects: true
enable_http2: true
queue_config:
capacity: 100
max_shards: 3
min_shards: 12
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 5s
sample_age_limit: 1m
metadata_config:
send: true
send_interval: 1m
max_samples_per_send: 100
Alertmanager version
No response
Alertmanager configuration file
No response
Logs
prometheus | ts=2024-04-23T10:06:18.038Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://receive-prometheus:9090/api/v1/write\": context deadline exceeded"
prometheus | ts=2024-04-23T10:07:26.043Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://receive-prometheus:9090/api/v1/write\": context deadline exceeded"
prometheus | ts=2024-04-23T10:08:34.050Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://receive-prometheus:9090/api/v1/write\": context deadline exceeded"
prometheus | ts=2024-04-23T10:09:11.336Z caller=dedupe.go:112 component=remote level=info remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Remote storage resharding" from=12 to=3
receive-prometheus | ts=2024-04-23T10:09:11.998Z caller=write_handler.go:77 level=error component=web msg="Error appending remote write" err="label name \"job\" is not unique: invalid sample"
receive-prometheus | ts=2024-04-23T10:09:11.998Z caller=write_handler.go:77 level=error component=web msg="Error appending remote write" err="label name \"job\" is not unique: invalid sample"
prometheus | ts=2024-04-23T10:09:12.029Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: label name \"job\" is not unique: invalid sample"
receive-prometheus | ts=2024-04-23T10:09:12.035Z caller=write_handler.go:134 level=error component=web msg="Out of order sample from remote write" err="duplicate sample for timestamp" series="{__name__=\"go_gc_duration_seconds\", __replica__=\"prometheus-local-test-0\", cluster=\"local-test\", instance=\"node-exporter:9100\", job=\"node-exporter\"}" timestamp=1713866946725
receive-prometheus | ts=2024-04-23T10:09:12.035Z caller=write_handler.go:77 level=error component=web msg="Error appending remote write" err="label name \"job\" is not unique: invalid sample"
prometheus | ts=2024-04-23T10:09:12.035Z caller=dedupe.go:112 component=remote level=error remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="non-recoverable error" count=2000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: duplicate sample for timestamp"
Attached is the docker compose we use to replicate the issue. docker-compose.yml
EDIT: Steps to reproduce:
- run the attached compose
- simulate outage of remote write receiving endpoints
date; docker pause mimir receive-prometheus - wait for at least sample_age_limit
- resume remote write
date; docker unpause mimir receive-prometheus
@marctc haven't you experienced similar issues? You might have the context as author of this feature 🙏
Attempt to fix this https://github.com/prometheus/prometheus/pull/14078
Here is a PR which reproduces the bug https://github.com/prometheus/prometheus/pull/14157
@cstyan Last version was cutted just a few days ago. Do you think we can expect a fix/patch version to be released with the fix, or it won't be released until next minor (2.54.0)?