mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Auto-forget unhealthy compactors

Open stevesg opened this issue 2 years ago • 5 comments

Is your feature request related to a problem? Please describe.

When a compactor is not cleanly shutdown, it is not unregistered from the ring. This means that if a compactor is ever uncleanly shutdown when it does not plan on coming back, a subset of tenants (those assigned to the missing instance) will never be compacted. Some example scenarios:

  • Compactors are scaled down, but for some reason, the compactor is shutdown uncleanly, leaving e.g. compactor-N in the ring.
  • The compactor is renamed, but was shutdown uncleanly before the rename, leaving the old compactor name in the ring.

This will avoid the need for manual intervention to resolve the issue.

Describe the solution you'd like

We should use ring.NewAutoForgetDelegate on the compactor ring. Compactors are not stateful in any way, so there is no harm in removing them automatically.

Describe alternatives you've considered

Do nothing - Compactors can continue to to be manually forgotten

Additional context

stevesg avatar Mar 31 '22 11:03 stevesg

I'm definitely up to add the auto-forget to compactor.

This means that if a compactor is ever uncleanly shutdown when it does not plan on coming back, a subset of tenants (those assigned to the missing instance) will never be compacted

I got surprised by this. If a compactor is unhealthy in the ring, I was expecting its jobs to be resharded to other compactor replicas. Could you double check this behaviour? There may be a bug somewhere.

pracucci avatar Mar 31 '22 12:03 pracucci

Alternative: have the forgetting done by an operator outside of the compactor ring. This way is easier to coordinate (operator can run a leader-election to ensure there is only one running), and can take account of other information, e.g. how many compactors are configured in the StatefulSet.

bboreham avatar Mar 31 '22 12:03 bboreham

Please open a related doc issue so that this can be documented for the operator in the core docs.

osg-grafana avatar Apr 01 '22 07:04 osg-grafana

Sorry, I'm using "operator" as a term of art meaning a (Kubernetes) program that automates some functions that historically a person would have done. https://kubernetes.io/docs/concepts/extend-kubernetes/operator/

bboreham avatar Apr 01 '22 07:04 bboreham

I got surprised by this. If a compactor is unhealthy in the ring, I was expecting its jobs to be resharded to other compactor replicas. Could you double check this behaviour? There may be a bug somewhere.

I think it's working as you'd expect - the compactor ring uses a replication factor of 1, because all it's doing is sharding jobs over compactors (it's basically using the ring for election). When an entry goes unhealthy, the tenants do not move. If we add this change, then it would function as you describe - the unhealthy compactors are forgotten, and the tenants will be moved.

stevesg avatar Apr 01 '22 10:04 stevesg