thanos
thanos copied to clipboard
query: invalid rate/irate with deduplication for most recent values
Thanos, Prometheus and Golang version used: bitnami/thanos:0.13.0 prom/prometheus:v2.19.0
Prometheus in HA configuration, 2 instances. Single Thanos querier instance.
Object Storage Provider: On premise minio deployment
What happened: Executing rate or irate with deduplication sometimes results in the most recent value being invalid. Either it shoots up to very high or very low value. Turning off deduplication produces correct results from both Prometheus instances.
With deduplication, good result

With deduplication, bad result. Executing the same query eventually gives incorrect result like this

Same query without deduplication always produces correct result

What you expected to happen:
Query with deduplication always giving correct result like this

How to reproduce it (as minimally and precisely as possible): Prometheus in HA, any rate or irate query.
Full logs to relevant components: Nothing that would correlate with queries.
Thanks for reporting. We thought we found all those kind of issues, but there might be more. The last one was supposed to be fixed in 0.13.0. Can you double check if you run Thanos with the fix from https://github.com/thanos-io/thanos/issues/2401 included? (:
Note that it's Querier version that matters for deduplication
Here's querier version report
thanos, version 0.13.0 (branch: HEAD, revision: adf6facb8d6bf44097aae084ec091ac3febd9eb8)
build user: root@ee9c796b3048
build date: 20200622-09:49:32
go version: go1.14.2
I can try 0.14.0 if it contains any relevant fixes
Please update but I think it should have the fix. If 0.14 won't work, then what would be awesome is to have exact chunks for that problematic period. You can obtain those by using following script: https://github.com/thanos-io/thanos/blob/40526f52f54d4501737e5246c0e71e56dd7e0b2d/scripts/insecure_grpcurl_series.sh against Querier gRPC API directly (: This will give us exact input that deduplication logic is using.
I think it has to do with some edge values.. :thinking:
0.14 has the same problem. Had to use different metric to catch it. The problem persists for a few second so I tried to get correct and incorrect results. bad.txt good.txt
I was about to report this same issue!
What is happening in my system at least is that the initialPenalty is by default 5000ms, but I have scrape intervals in the 15s-30s range. The two HA Prometheus instances will commonly scrape the same target ~10s apart which means the deduplication logic will always switch series after the first sample causing extra data.
Here is a failing test that reproduces the behavior I see: https://github.com/thanos-io/thanos/compare/master...csmarchbanks:unstable-queries
I pushed out a custom version of thanos that uses a 30s initial penalty and the problem has gone away for me. However, if someone had a 1m scrape interval a 30s initial penalty still would not be enough, and would be way too big for someone with a 5s scrape interval.
We have similar setup. Two Prometheus instances scraping the same targets with 10s interval.
Awesome. Looks like we might want to adjust penalty based on dynamic interval. The edge case is when interval is reconfigured to be different (:
On Wed, 15 Jul 2020 at 17:17, Antonenko Artem [email protected] wrote:
We have similar setup. Two Prometheus instances scraping the same targets with 10s interval.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/2890#issuecomment-658853252, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3OYVAJUZEHTV6CWKTG3R3XI73ANCNFSM4OZKKE3A .
I opened a PR that uses the request resolution to handle cases like this one. I am still working on the tests, but so far it is looking good
https://github.com/thanos-io/thanos/pull/3010
update: resolution based dedup doesn't work with promql functions so reverted the PR.
In our case adjusting the default look back delta solved the problem so it seems that the problem here is different and caused by scrape time shifting.
@csmarchbanks in your case I think the main problem is that the scrapping between the different replicas is shifted by 30sec. That seems a lot if you manage to align it better problem should be solved.
I think I'm running into a similar issue on Thanos 0.14.0
The promql is sum(irate(my_app_counter[5m])), where my_app_counter is a counter.
Thanos with deduplication unchecked:
Thanos with deduplication checked shows a huge spike:
Prometheus replica 1:
Prometheus replica 2:

I have 2 replica Prometheus pollers scraping with a 60s interval. I tried the suggestion here to increase "initialPenalty" for a few tests, but even setting "initialPenalty = 60,000" didn't help: https://github.com/thanos-io/thanos/issues/2890#issuecomment-658810446
It's possible they are out of sync - what's the best way to get them back in sync? Restart both pollers simultaneously and pray?
try adjusting the look-back delta - this should solve the problem.
Thanks - looks like I need to upgrade to 0.15.0+ to be able to use this new flag? What's a good value to pick with a scrape interval of 60s? Is the 5 minute default not big enough?
Looks like upgrading to 0.16.0-rc0 and modifying --query.lookback-delta isn't fixing the above situation.
hm, strange then I am out of ideas, sorry.
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Still to investigate to ensure solid deduplication algorithm (:
Kind Regards, Bartek Płotka (@bwplotka)
On Tue, 8 Dec 2020 at 09:33, stale[bot] [email protected] wrote:
Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command https://probot.github.io/apps/reminders/ if you wish to be reminded at some point in future.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/2890#issuecomment-740502143, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3OZ3DEIVZXINKFNHEYTSTXXGVANCNFSM4OZKKE3A .
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Still to investigate / try to repro
Still valid and needs investigation.
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Still valid.
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Closing for now as promised, let us know if you need this to be reopened! 🤗
I think this is still valid, not stale.
Just hit and discovered this issue with thanos 0.32.4
@bwplotka this issue has been closed by the bot while the bug still valid. Can we have a feedback and reopen the issue?
Could someone help by uploading two blocks and then sharing what query to execute and on which timestamp to reproduce this?
I am facing same problem with 0.32.5 version. Is there any solution?