mimir Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index

Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index

Open seizethedave opened this issue 9 months ago • 0 comments

What this PR does

This PR adds retries to operations in BlocksCleaner.cleanUser whose failures could lead to the bucket index failing to be rewritten. (ReadIndex and WriteIndex.)

And why:

When the blocks cleaner runs for a tenant, it carries out a series of steps to perform one cleanUser pass. Most of these steps involve an objstore invocation. (Fetching a block index, iterating the paths under a block folder, deleting a marker...) In these series of steps, there are currently two avenues for "retries":

Retries that the GCS, Minio (and so on) objstore provider SDKs perform. For example, the GCS SDK will automatically retry operations that it deems idempotent. And it has a suite of rules to determine which errors it will retry. Minio has similar (but different) policies around automatically retrying things.
every 15 minutes (by default) the tenant's block cleaner job will be run again.

We are currently relying on Avenue 2 to eventually recover from past block cleaner failures. But the crux of a recent incident was that the stuff in cleanUser must 100% complete for the updated bucket index to be written. If cleanUser fails enough consecutive times, store-gateways will refuse to load the "stale" bucket index, and some queries will begin to fail. In that incident, a larger percentage of obj store calls were exceeding their context deadline (which looks like network flakiness) hence the >=4 consecutive cleanUser failures leading to a >=1 hour stale bucket index.

Notes:

ReadIndex and WriteIndex already have 1 minute hardcoded deadlines, so the new outer request deadlines I've chosen for those are safe.
UpdateIndex could use a retry, too, because if that method returns an error, the bucket index won't be rewritten. However, I've done some analysis in our logs and UpdateIndex for some tenants can take 5+ minutes while it updates scads of deletion markers. I'm not going to add time-based retries on that method so as not to accidentally break any legitimate work being done. There's room for improvement to come back and add finer grained retries on the operations inside of UpdateIndex.

Which issue(s) this PR fixes or relates to

Relates to #7980

Checklist

[x] Tests updated.
[ ] Documentation added.
[x] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
[ ] about-versioning.md updated with experimental features.

May 06 '24 22:05 seizethedave

mimir mimir copied to clipboard

Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index

What this PR does

And why:

Notes:

Which issue(s) this PR fixes or relates to

Checklist

mimir
mimir copied to clipboard