mimir
mimir copied to clipboard
Auto-forget unhealthy compactors
Is your feature request related to a problem? Please describe.
When a compactor is not cleanly shutdown, it is not unregistered from the ring. This means that if a compactor is ever uncleanly shutdown when it does not plan on coming back, a subset of tenants (those assigned to the missing instance) will never be compacted. Some example scenarios:
- Compactors are scaled down, but for some reason, the compactor is shutdown uncleanly, leaving e.g. compactor-N in the ring.
- The compactor is renamed, but was shutdown uncleanly before the rename, leaving the old compactor name in the ring.
This will avoid the need for manual intervention to resolve the issue.
Describe the solution you'd like
We should use ring.NewAutoForgetDelegate
on the compactor ring. Compactors are not stateful in any way, so there is no harm in removing them automatically.
Describe alternatives you've considered
Do nothing - Compactors can continue to to be manually forgotten
Additional context
I'm definitely up to add the auto-forget to compactor.
This means that if a compactor is ever uncleanly shutdown when it does not plan on coming back, a subset of tenants (those assigned to the missing instance) will never be compacted
I got surprised by this. If a compactor is unhealthy in the ring, I was expecting its jobs to be resharded to other compactor replicas. Could you double check this behaviour? There may be a bug somewhere.
Alternative: have the forgetting done by an operator outside of the compactor ring. This way is easier to coordinate (operator can run a leader-election to ensure there is only one running), and can take account of other information, e.g. how many compactors are configured in the StatefulSet.
Please open a related doc issue so that this can be documented for the operator in the core docs.
Sorry, I'm using "operator" as a term of art meaning a (Kubernetes) program that automates some functions that historically a person would have done. https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
I got surprised by this. If a compactor is unhealthy in the ring, I was expecting its jobs to be resharded to other compactor replicas. Could you double check this behaviour? There may be a bug somewhere.
I think it's working as you'd expect - the compactor ring uses a replication factor of 1, because all it's doing is sharding jobs over compactors (it's basically using the ring for election). When an entry goes unhealthy, the tenants do not move. If we add this change, then it would function as you describe - the unhealthy compactors are forgotten, and the tenants will be moved.