thanos icon indicating copy to clipboard operation
thanos copied to clipboard

Flaky compact penalty deduplication E2E test

Open yeya24 opened this issue 4 years ago • 10 comments

Link: https://github.com/thanos-io/thanos/runs/4199928541?check_suite_focus=true

=== CONT  TestCompactWithStoreGatewayWithPenaltyDedup/dedup_enabled;_no_delete_delay;_compactor_should_work_and_remove_things_as_expected
Error:     compact_test.go:759: compact_test.go:759:
        
         unexpected error: unable to find metrics [thanos_compact_blocks_marked_total] with expected values after 50 retries. Last error: <nil>. Last values: [2]

This flaky error seems to happen a lot. In line https://github.com/thanos-io/thanos/blob/main/test/e2e/compact_test.go#L759, ideally we should get 0 for this metric because all the compactions are done in the previous step so all source blocks are already marked for deletion. Need investigation for this error.

yeya24 avatar Nov 13 '21 19:11 yeya24

We started hit a lot of flakes recently on this test case, but looks like the issue is with thanos_blocks_meta_synced as well, on line 848 with:

unexpected error: unable to find metrics [thanos_blocks_meta_synced] with expected values after 50 retries. Last error: <nil>. Last values: [43]

matej-g avatar Dec 16 '21 16:12 matej-g

Ran this test quite a few times locally and cannot reproduce :/

GiedriusS avatar Dec 17 '21 09:12 GiedriusS

Not 100% sure it's related, but seems likely: I keep seeing

unexpected error: unable to find metrics [thanos_compact_iterations_total] with expected values after 50 retries. Last error: <nil>. Last values: [0]

From both the TestCompactWithStoreGatewayWithPenaltyDedup and TestCompactWithStoreGateway tests in CI. Local runs of those tests pass just fine.

EDIT: This is after pulling in the latest main branch with the parallel test changes for CI.

nberkley avatar Dec 22 '21 16:12 nberkley

Ok, nope, pretty sure my thing is unrelated. It complains about overlapping blocks and halts in a test specifically setting up overlapping blocks. Not sure yet how this differs between CI and local, but it's definitely not the same as those timeout issues.

nberkley avatar Dec 22 '21 17:12 nberkley

I want to try to improve this by making the back off in the method waiting on the metrics to be configurable in the upstream and increase retry numbers and / or back off interval.

matej-g avatar Feb 08 '22 17:02 matej-g

Fixed by https://github.com/thanos-io/thanos/pull/5246, let's finally close this :closed_book:

matej-g avatar Mar 21 '22 09:03 matej-g

It's still haunting us :cry: See e.g. https://github.com/thanos-io/thanos/runs/6641229048?check_suite_focus=true but I've seen it multiple times again.

matej-g avatar May 30 '22 15:05 matej-g

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Jul 31 '22 04:07 stale[bot]

Still valid

yeya24 avatar Jul 31 '22 04:07 yeya24

Are we seeing this flake after https://github.com/thanos-io/thanos/pull/5563?

matej-g avatar Aug 04 '22 08:08 matej-g

As this is now popping up more often, I have suggested to disable the test again https://github.com/thanos-io/thanos/pull/5731.

matej-g avatar Sep 27 '22 08:09 matej-g

I think this one has been fixed. At least I don't see it failing since a while after https://github.com/thanos-io/thanos/pull/6064 was merged.

douglascamata avatar May 04 '23 13:05 douglascamata

I was thinking if this is resolved but then noticed this run https://github.com/thanos-io/thanos/actions/runs/4882598103/jobs/8712867711?pr=6336#step:5:2173 today 😞

matej-g avatar May 04 '23 13:05 matej-g

@matej-g what a mouth I have. At least that's a different error than the one I identified and fixed back then with the PR I mentioned. :/

douglascamata avatar May 04 '23 14:05 douglascamata