mimir
mimir copied to clipboard
Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index
What this PR does
This PR adds retries to operations in BlocksCleaner.cleanUser
whose failures could lead to the bucket index failing to be rewritten. (ReadIndex
and WriteIndex
.)
And why:
When the blocks cleaner runs for a tenant, it carries out a series of steps to perform one cleanUser
pass. Most of these steps involve an objstore invocation. (Fetching a block index, iterating the paths under a block folder, deleting a marker...)
In these series of steps, there are currently two avenues for "retries":
- Retries that the GCS, Minio (and so on) objstore provider SDKs perform. For example, the GCS SDK will automatically retry operations that it deems idempotent. And it has a suite of rules to determine which errors it will retry. Minio has similar (but different) policies around automatically retrying things.
- every 15 minutes (by default) the tenant's block cleaner job will be run again.
We are currently relying on Avenue 2 to eventually recover from past block cleaner failures. But the crux of a recent incident was that the stuff in cleanUser must 100% complete for the updated bucket index to be written. If cleanUser fails enough consecutive times, store-gateways will refuse to load the "stale" bucket index, and some queries will begin to fail. In that incident, a larger percentage of obj store calls were exceeding their context deadline (which looks like network flakiness) hence the >=4 consecutive cleanUser failures leading to a >=1 hour stale bucket index.
Notes:
-
ReadIndex
andWriteIndex
already have 1 minute hardcoded deadlines, so the new outer request deadlines I've chosen for those are safe. -
UpdateIndex
could use a retry, too, because if that method returns an error, the bucket index won't be rewritten. However, I've done some analysis in our logs and UpdateIndex for some tenants can take 5+ minutes while it updates scads of deletion markers. I'm not going to add time-based retries on that method so as not to accidentally break any legitimate work being done. There's room for improvement to come back and add finer grained retries on the operations inside ofUpdateIndex
.
Which issue(s) this PR fixes or relates to
- Relates to #7980
Checklist
- [x] Tests updated.
- [ ] Documentation added.
- [x]
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
. - [ ]
about-versioning.md
updated with experimental features.