prometheus icon indicating copy to clipboard operation
prometheus copied to clipboard

Promethues counter decreases by 1 for some time series data

Open ashishvaishno opened this issue 10 months ago • 7 comments

What did you do?

I noticed lately a huge spike in one of our metrics.

If you look at the highlighted value 7756564 at epoch 1712719298.819 and the new entry has 1 value less than the previous one. this is the reason of the spike in rate/increase function There was no restart on prometheus or the target in this case. What can contribute to this dip in value? Screenshot 2024-04-17 at 10 24 46

Below is graph of the data for 2 week Screenshot 2024-04-17 at 10 28 45

Here is the screenshot of the spike Screenshot 2024-04-17 at 10 32 10

What did you expect to see?

I would expect not to see a decrease in counter

What did you see instead? Under which circumstances?

We are running a HA setup of promethues (2 stateful set) with thanos.

System information

Linux 5.10.192-183.736.amzn2.x86_64 x86_64

Prometheus version

prometheus, version 2.45.0 (branch: HEAD, revision: 8ef767e396bf8445f009f945b0162fd71827f445)
  build user:       root@920118f645b7
  build date:       20230623-15:09:49
  go version:       go1.20.5
  platform:         linux/amd64
  tags:             netgo,builtinassets,stringlabels

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

I have enabled debug logs on promethues now, will update the thread if I see something

ashishvaishno avatar Apr 18 '24 12:04 ashishvaishno

This is unlikely to be a bug in Prometheus but most likely problem on your end. If you look at timestamps you’ll notice that they are duplicated. There are always two samples ~20ms apart from each other. You might be scraping the same target twice or two different targets ends up with identical time series. When everything works smoothly you won’t notice any problems. But if there’s a delay with either of these scrapes then it might result in data like you see above, mostly because timestamp of each sample is the beginning of the scrape request. If one scrape starts, gets delayed on dns or connect attempt, but the other one is fast, then the slow scrape might end up with lower timestamp but higher value.

prymitive avatar Apr 18 '24 12:04 prymitive

@prymitive Is there a way to handle this as I would need 2 statefulsets of promethues, Thanos does take care of deduplication but this delay might be difficult to manage right?

ashishvaishno avatar Apr 19 '24 07:04 ashishvaishno

Handle what exactly? In Prometheus you’re supposed to have unique labels on all time series. Automatic injection of job and instance labels usually ensures this. So first you need to understand why you have two scrapes that result in the same time series.

prymitive avatar Apr 19 '24 07:04 prymitive

@prymitive I have different labels for the metrics. Since we have two promethues setup, they both scrape data at 60s interval based upon when each stateful set starts. For any de-deplication thanos takes care of these situation. Now if I understood you point correctly there are "few" moments in time that scrape time of counter are slightly off. When the data is aggregated and queried on thanos I get the issue. Or i have a wrong understanding?

Example : These are the label for my metrics

request_count_total{app="test", exported_id="test-594b9d94fc-kgdcg", exported_service="test", id="test-594b9d94fc-kgdcg", instance="172.26.19.57", job="kubernetes-services-pods", name="test", namespace="test", pod_template_hash="594b9d94fc", prometheus="monitoring/prometheus-stack-kube-prom-prometheus", service="test", system="INTERNET"}

On prom-0, i have this a value Screenshot 2024-04-19 at 10 46 56 ON prom-1, i that this value Screenshot 2024-04-19 at 10 47 20

On thanos querier :

Screenshot 2024-04-19 at 10 46 34

Scrape duration on these endpoint is less than 0.1 sec as well

ashishvaishno avatar Apr 19 '24 08:04 ashishvaishno

If you use thanos and that’s where you see this problem then maybe thanos is merging two counters from two different Prometheus servers into a single time series? Try your query on both Prometheus servers directly, if that works then you need to add some unique external labels on each Prometheus.

prymitive avatar Apr 19 '24 09:04 prymitive

@prymitive I am already adding global external labels as promethues_replica: $(POD_NAME) in prometheus config, which is then used in thanos queries for de-duplication as --query.replica-label=prometheus_replica.

ashishvaishno avatar Apr 19 '24 10:04 ashishvaishno

Indeed 20ms come from two different Prometheis servers. It looks like a configuration issue on the Thanos side.

In your last comment you have a typo: promethues_replica , is it like that in your config too?

roidelapluie avatar Apr 23 '24 08:04 roidelapluie