thanos query: invalid rate/irate with deduplication for most recent values

trafficstars

Thanos, Prometheus and Golang version used: bitnami/thanos:0.13.0 prom/prometheus:v2.19.0

Prometheus in HA configuration, 2 instances. Single Thanos querier instance.

Object Storage Provider: On premise minio deployment

What happened: Executing rate or irate with deduplication sometimes results in the most recent value being invalid. Either it shoots up to very high or very low value. Turning off deduplication produces correct results from both Prometheus instances.

With deduplication, good result with_dedup_good

With deduplication, bad result. Executing the same query eventually gives incorrect result like this with_dedup_bad

Same query without deduplication always produces correct result without_dedup

What you expected to happen: Query with deduplication always giving correct result like this with_dedup_good

How to reproduce it (as minimally and precisely as possible): Prometheus in HA, any rate or irate query.

Full logs to relevant components: Nothing that would correlate with queries.

Jul 14 '20 09:07 creker

Thanks for reporting. We thought we found all those kind of issues, but there might be more. The last one was supposed to be fixed in 0.13.0. Can you double check if you run Thanos with the fix from https://github.com/thanos-io/thanos/issues/2401 included? (:

Jul 14 '20 12:07 bwplotka

Note that it's Querier version that matters for deduplication

Jul 14 '20 12:07 bwplotka

Here's querier version report

thanos, version 0.13.0 (branch: HEAD, revision: adf6facb8d6bf44097aae084ec091ac3febd9eb8)
  build user:       root@ee9c796b3048
  build date:       20200622-09:49:32
  go version:       go1.14.2

Jul 14 '20 12:07 creker

I can try 0.14.0 if it contains any relevant fixes

Jul 14 '20 12:07 creker

Please update but I think it should have the fix. If 0.14 won't work, then what would be awesome is to have exact chunks for that problematic period. You can obtain those by using following script: https://github.com/thanos-io/thanos/blob/40526f52f54d4501737e5246c0e71e56dd7e0b2d/scripts/insecure_grpcurl_series.sh against Querier gRPC API directly (: This will give us exact input that deduplication logic is using.

I think it has to do with some edge values.. :thinking:

Jul 14 '20 12:07 bwplotka

0.14 has the same problem. Had to use different metric to catch it. The problem persists for a few second so I tried to get correct and incorrect results. bad.txt good.txt

Jul 14 '20 22:07 creker

I was about to report this same issue!

What is happening in my system at least is that the initialPenalty is by default 5000ms, but I have scrape intervals in the 15s-30s range. The two HA Prometheus instances will commonly scrape the same target ~10s apart which means the deduplication logic will always switch series after the first sample causing extra data.

Here is a failing test that reproduces the behavior I see: https://github.com/thanos-io/thanos/compare/master...csmarchbanks:unstable-queries

I pushed out a custom version of thanos that uses a 30s initial penalty and the problem has gone away for me. However, if someone had a 1m scrape interval a 30s initial penalty still would not be enough, and would be way too big for someone with a 5s scrape interval.

Jul 15 '20 14:07 csmarchbanks

We have similar setup. Two Prometheus instances scraping the same targets with 10s interval.

Jul 15 '20 16:07 creker

Awesome. Looks like we might want to adjust penalty based on dynamic interval. The edge case is when interval is reconfigured to be different (:

On Wed, 15 Jul 2020 at 17:17, Antonenko Artem [email protected] wrote:

We have similar setup. Two Prometheus instances scraping the same targets with 10s interval.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/2890#issuecomment-658853252, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3OYVAJUZEHTV6CWKTG3R3XI73ANCNFSM4OZKKE3A .

Jul 16 '20 09:07 bwplotka

I opened a PR that uses the request resolution to handle cases like this one. I am still working on the tests, but so far it is looking good

https://github.com/thanos-io/thanos/pull/3010

Aug 13 '20 14:08 krasi-georgiev

update: resolution based dedup doesn't work with promql functions so reverted the PR.

In our case adjusting the default look back delta solved the problem so it seems that the problem here is different and caused by scrape time shifting.

@csmarchbanks in your case I think the main problem is that the scrapping between the different replicas is shifted by 30sec. That seems a lot if you manage to align it better problem should be solved.

Aug 27 '20 11:08 krasi-georgiev

I think I'm running into a similar issue on Thanos 0.14.0

The promql is sum(irate(my_app_counter[5m])), where my_app_counter is a counter.

Thanos with deduplication unchecked: Thanos with deduplication checked shows a huge spike: Prometheus replica 1: Prometheus replica 2:

I have 2 replica Prometheus pollers scraping with a 60s interval. I tried the suggestion here to increase "initialPenalty" for a few tests, but even setting "initialPenalty = 60,000" didn't help: https://github.com/thanos-io/thanos/issues/2890#issuecomment-658810446

It's possible they are out of sync - what's the best way to get them back in sync? Restart both pollers simultaneously and pray?

Oct 06 '20 18:10 sevagh

try adjusting the look-back delta - this should solve the problem.

Oct 07 '20 08:10 krasi-georgiev

Thanks - looks like I need to upgrade to 0.15.0+ to be able to use this new flag? What's a good value to pick with a scrape interval of 60s? Is the 5 minute default not big enough?

Oct 07 '20 13:10 sevagh

Looks like upgrading to 0.16.0-rc0 and modifying --query.lookback-delta isn't fixing the above situation.

Oct 07 '20 17:10 sevagh

hm, strange then I am out of ideas, sorry.

Oct 08 '20 11:10 krasi-georgiev

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Dec 08 '20 09:12 stale[bot]

Still to investigate to ensure solid deduplication algorithm (:

Kind Regards, Bartek Płotka (@bwplotka)

On Tue, 8 Dec 2020 at 09:33, stale[bot] [email protected] wrote:

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command https://probot.github.io/apps/reminders/ if you wish to be reminded at some point in future.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/2890#issuecomment-740502143, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3OZ3DEIVZXINKFNHEYTSTXXGVANCNFSM4OZKKE3A .

Dec 08 '20 09:12 bwplotka

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Feb 06 '21 15:02 stale[bot]

Still to investigate / try to repro

Feb 07 '21 20:02 bwplotka

Still valid and needs investigation.

Feb 10 '21 08:02 kakkoyun

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Jun 03 '21 02:06 stale[bot]

Still valid.

Jun 03 '21 09:06 kakkoyun

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Aug 02 '21 18:08 stale[bot]

Closing for now as promised, let us know if you need this to be reopened! 🤗

Aug 17 '21 02:08 stale[bot]

I think this is still valid, not stale.

May 02 '23 17:05 SuperQ

Just hit and discovered this issue with thanos 0.32.4

Oct 10 '23 15:10 rrondeau

@bwplotka this issue has been closed by the bot while the bug still valid. Can we have a feedback and reopen the issue?

Oct 11 '23 13:10 Lord-Y

Could someone help by uploading two blocks and then sharing what query to execute and on which timestamp to reproduce this?

May 03 '24 09:05 GiedriusS

I am facing same problem with 0.32.5 version. Is there any solution?

Aug 13 '24 11:08 jianlong0808

thanos thanos copied to clipboard

query: invalid rate/irate with deduplication for most recent values

thanos
thanos copied to clipboard