thanos
thanos copied to clipboard
Flaky compact penalty deduplication E2E test
Link: https://github.com/thanos-io/thanos/runs/4199928541?check_suite_focus=true
=== CONT TestCompactWithStoreGatewayWithPenaltyDedup/dedup_enabled;_no_delete_delay;_compactor_should_work_and_remove_things_as_expected
Error: compact_test.go:759: compact_test.go:759:
unexpected error: unable to find metrics [thanos_compact_blocks_marked_total] with expected values after 50 retries. Last error: <nil>. Last values: [2]
This flaky error seems to happen a lot. In line https://github.com/thanos-io/thanos/blob/main/test/e2e/compact_test.go#L759, ideally we should get 0 for this metric because all the compactions are done in the previous step so all source blocks are already marked for deletion. Need investigation for this error.
We started hit a lot of flakes recently on this test case, but looks like the issue is with thanos_blocks_meta_synced as well, on line 848 with:
unexpected error: unable to find metrics [thanos_blocks_meta_synced] with expected values after 50 retries. Last error: <nil>. Last values: [43]
Ran this test quite a few times locally and cannot reproduce :/
Not 100% sure it's related, but seems likely: I keep seeing
unexpected error: unable to find metrics [thanos_compact_iterations_total] with expected values after 50 retries. Last error: <nil>. Last values: [0]
From both the TestCompactWithStoreGatewayWithPenaltyDedup and TestCompactWithStoreGateway tests in CI. Local runs of those tests pass just fine.
EDIT: This is after pulling in the latest main branch with the parallel test changes for CI.
Ok, nope, pretty sure my thing is unrelated. It complains about overlapping blocks and halts in a test specifically setting up overlapping blocks. Not sure yet how this differs between CI and local, but it's definitely not the same as those timeout issues.
I want to try to improve this by making the back off in the method waiting on the metrics to be configurable in the upstream and increase retry numbers and / or back off interval.
Fixed by https://github.com/thanos-io/thanos/pull/5246, let's finally close this :closed_book:
It's still haunting us :cry: See e.g. https://github.com/thanos-io/thanos/runs/6641229048?check_suite_focus=true but I've seen it multiple times again.
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Still valid
Are we seeing this flake after https://github.com/thanos-io/thanos/pull/5563?
As this is now popping up more often, I have suggested to disable the test again https://github.com/thanos-io/thanos/pull/5731.
I think this one has been fixed. At least I don't see it failing since a while after https://github.com/thanos-io/thanos/pull/6064 was merged.
I was thinking if this is resolved but then noticed this run https://github.com/thanos-io/thanos/actions/runs/4882598103/jobs/8712867711?pr=6336#step:5:2173 today 😞
@matej-g what a mouth I have. At least that's a different error than the one I identified and fixed back then with the PR I mentioned. :/