monogon
monogon copied to clipboard
Equinix Shepherd: remove Failed machines
Some Equinix machines we provision (currently around 1-10%) get stuck in Failed per the Equinix API after we attempt to provision them.
The shepherd should probably notice this and nuke them from both Equinix and the BMDB.
But this ties into a bigger problem: we need a very safe deletion mechanism. Maybe something like:
To mark a Provided machine for deletion (eg. when this issue happens):
- Check the machine is not user visible per the BMDB.
- Mark the equinix machine with a 'tombstone' tag containing the current timestamp (the Equinix Shepherd should ignore any machines with this tag during reconciliation).
- Add a Tombstone tag to BMDB (the Shepherd should also ignore these when sourcing the BMDB).
Then, a separate Reaper process should:
Loop A (“and stay dead!“):
- Go through all BMDB Tombstone machines;
- Make sure each BMDB Tombstone machine has a Tombstone tag in Equinix; and
- Make sure each BMDB Tombstone machine has its corresponding Equinix device powered off
Loop B (“time to go“):
- Go through all BMDB Tombstone machines;
- Remove the machine from BMDB if the Tombstone tag has been set for long enough; then
- Remove the corresponding Equinix device
Loop C (“did we forget something”):
- Find any Equinix devices with Tombstone tags that do not have corresponding BMDB devices and alert on them (as it means we dropped something during loop B, and that should be manually investigated, at least for now).
We should skip the badbadnotgood state when there are only machines inside the provider which are failed even when the bmdb entry is missing