monogon icon indicating copy to clipboard operation
monogon copied to clipboard

Equinix Shepherd: remove Failed machines

Open q3k opened this issue 2 years ago • 1 comments

Some Equinix machines we provision (currently around 1-10%) get stuck in Failed per the Equinix API after we attempt to provision them.

The shepherd should probably notice this and nuke them from both Equinix and the BMDB.

But this ties into a bigger problem: we need a very safe deletion mechanism. Maybe something like:

To mark a Provided machine for deletion (eg. when this issue happens):

  1. Check the machine is not user visible per the BMDB.
  2. Mark the equinix machine with a 'tombstone' tag containing the current timestamp (the Equinix Shepherd should ignore any machines with this tag during reconciliation).
  3. Add a Tombstone tag to BMDB (the Shepherd should also ignore these when sourcing the BMDB).

Then, a separate Reaper process should:

Loop A (“and stay dead!“):

  1. Go through all BMDB Tombstone machines;
  2. Make sure each BMDB Tombstone machine has a Tombstone tag in Equinix; and
  3. Make sure each BMDB Tombstone machine has its corresponding Equinix device powered off

Loop B (“time to go“):

  1. Go through all BMDB Tombstone machines;
  2. Remove the machine from BMDB if the Tombstone tag has been set for long enough; then
  3. Remove the corresponding Equinix device

Loop C (“did we forget something”):

  1. Find any Equinix devices with Tombstone tags that do not have corresponding BMDB devices and alert on them (as it means we dropped something during loop B, and that should be manually investigated, at least for now).

q3k avatar Feb 14 '23 20:02 q3k

We should skip the badbadnotgood state when there are only machines inside the provider which are failed even when the bmdb entry is missing

fionera avatar Jun 15 '23 13:06 fionera