prometheus
prometheus copied to clipboard
Promethues counter decreases by 1 for some time series data
What did you do?
I noticed lately a huge spike in one of our metrics.
If you look at the highlighted value 7756564 at epoch 1712719298.819 and the new entry has 1 value less than the previous one. this is the reason of the spike in rate/increase function
There was no restart on prometheus or the target in this case. What can contribute to this dip in value?
Below is graph of the data for 2 week
Here is the screenshot of the spike
What did you expect to see?
I would expect not to see a decrease in counter
What did you see instead? Under which circumstances?
We are running a HA setup of promethues (2 stateful set) with thanos.
System information
Linux 5.10.192-183.736.amzn2.x86_64 x86_64
Prometheus version
prometheus, version 2.45.0 (branch: HEAD, revision: 8ef767e396bf8445f009f945b0162fd71827f445)
build user: root@920118f645b7
build date: 20230623-15:09:49
go version: go1.20.5
platform: linux/amd64
tags: netgo,builtinassets,stringlabels
Prometheus configuration file
No response
Alertmanager version
No response
Alertmanager configuration file
No response
Logs
I have enabled debug logs on promethues now, will update the thread if I see something
This is unlikely to be a bug in Prometheus but most likely problem on your end. If you look at timestamps you’ll notice that they are duplicated. There are always two samples ~20ms apart from each other. You might be scraping the same target twice or two different targets ends up with identical time series. When everything works smoothly you won’t notice any problems. But if there’s a delay with either of these scrapes then it might result in data like you see above, mostly because timestamp of each sample is the beginning of the scrape request. If one scrape starts, gets delayed on dns or connect attempt, but the other one is fast, then the slow scrape might end up with lower timestamp but higher value.
@prymitive Is there a way to handle this as I would need 2 statefulsets of promethues, Thanos does take care of deduplication but this delay might be difficult to manage right?
Handle what exactly? In Prometheus you’re supposed to have unique labels on all time series. Automatic injection of job and instance labels usually ensures this. So first you need to understand why you have two scrapes that result in the same time series.
@prymitive I have different labels for the metrics. Since we have two promethues setup, they both scrape data at 60s interval based upon when each stateful set starts. For any de-deplication thanos takes care of these situation. Now if I understood you point correctly there are "few" moments in time that scrape time of counter are slightly off. When the data is aggregated and queried on thanos I get the issue. Or i have a wrong understanding?
Example : These are the label for my metrics
request_count_total{app="test", exported_id="test-594b9d94fc-kgdcg", exported_service="test", id="test-594b9d94fc-kgdcg", instance="172.26.19.57", job="kubernetes-services-pods", name="test", namespace="test", pod_template_hash="594b9d94fc", prometheus="monitoring/prometheus-stack-kube-prom-prometheus", service="test", system="INTERNET"}
On prom-0, i have this a value
ON prom-1, i that this value
On thanos querier :
Scrape duration on these endpoint is less than 0.1 sec as well
If you use thanos and that’s where you see this problem then maybe thanos is merging two counters from two different Prometheus servers into a single time series? Try your query on both Prometheus servers directly, if that works then you need to add some unique external labels on each Prometheus.
@prymitive I am already adding global external labels as promethues_replica: $(POD_NAME)
in prometheus config, which is then used in thanos queries for de-duplication as --query.replica-label=prometheus_replica
.
Indeed 20ms come from two different Prometheis servers. It looks like a configuration issue on the Thanos side.
In your last comment you have a typo: promethues_replica , is it like that in your config too?