cruise-control icon indicating copy to clipboard operation
cruise-control copied to clipboard

Self-healing can forever "miss" broker failures that occur while it is disabled

Open mgrubent opened this issue 4 years ago • 2 comments

Background

cruise-control provides self-healing for a variety of anomalies.

When an anomaly is detected, and the corresponding self-healing boolean is enabled, cruise-control can generate a proposal on its own to correct the anomaly, and then execute that proposal.

When these anomalies happen while the corresponding self-healing boolean is disabled, cruise-control (as-expected) does not generate proposals to correct the anomaly.

However, when an anomaly (e.g. a broker failure) occurs while self-healing is disabled, that same anomaly event will not cause cruise-control to generate a proposal when self-healing is later enabled.

Problem

Anomalies that occur while self-healing are disabled can essentially be "lost" to cruise-control, and some other actor has to tell cruise-control affirmatively to act. Re-enabling self-healing is insufficient to cause cruise-control to heal the anomaly.

Proposal

When cruise-control's anomaly-detection is re-enabled, any outstanding (i.e. not stale) anomalies should be acted on.

mgrubent avatar Oct 29 '20 21:10 mgrubent

The issue description is a little off. This is only relevant for a broker_failure anomaly, under the following conditions:

  • A broker fails at time t,
  • Anomaly detector starts its first grace period before sending a notification at time t' > t,
  • At time t', anomaly detector notifier sends a notification regarding broker failure (this is independent of whether self-healing is enabled or not),
  • Anomaly detector starts its second grace period before attempting to start self-healing at t" > t',
  • At time t", if self-healing is enabled, anomaly detector starts self-healing, otherwise ignores the broker failure,
  • If self-healing was disabled at time t", but later enabled at time t"', this will not make CC start fixing the broker failure that was ignored at time t".

Ideally, CC broker failure self-healing should act on a broker failure that happened before self-healing was enabled.

efeg avatar Oct 29 '20 21:10 efeg

Ah; will update the title

mgrubent avatar Oct 29 '20 21:10 mgrubent