metal-api icon indicating copy to clipboard operation
metal-api copied to clipboard

Mitigate "machine was allocated without proper switch connections"

Open majst01 opened this issue 5 years ago • 5 comments

There are two possibilities to get into a state where you have a machine that you cannot reach over the network after the allocation:

  1. We register a new switch at the metal-api (metal-core starts for the first time and registers), but machines are already in waiting state (which can happen after a wrong update sequence or broken switch)
  2. You start a machine which has a blade switch in between (like t1-small) where LLDP cannot discover the connections to the leaf switches

In both cases, we cannot find out to which switch a machine is connected to.

This can lead to the following failure state:

  • When you allocate a machine that is not in the switches machine connections
  • The machine starts to boot
  • The machine will not be enslaved into a VRF
  • The machine will not be reachable from external networks

Can we prevent this state? As this is actually confusing... the resulting machines are unusable for a user.

For scenario (1) you can get the switch connection after rebooting the machine and everything would be fine.


Both problems can be mitigated an assertion like this: the machine report should fail if there are not two switches visible from the machines. This will cause the report to fail more often and the t1-small servers won't get to the waiting state any more.


To be honest, it is not so likely to get into this state. The last time this happened was because we updated the metal-core, the metal-api and wiped the rethinkdb. However, it's better for the robustness if we prevent these states anyway as they are possibly easy to prevent.

The problem is: The metal-api does not care if there are two switch connections to the machine or not. It will allow machine allocation without this condition fulfilled. The metal-hammer could actually report some wild stuff about switch neighbors to the metal-api, the api would say "fine" and when you allocate it, you would end up with an unusable machine. And this is what happened: The "machine connections" got lost because we had new switches registered at the api, but the machines behind the switches were already in the waiting state. The metal-api should at least validate if it is actually able to construct a proper switch configuration before allowing machine allocation.

--

Ideally, such a machine should not even be able to enter the wait table. This would cause a reboot of the machine re-reporting the connections + not having a user allocate such a machine.

majst01 avatar Mar 13 '20 08:03 majst01

@Gerrit91 was #31 related to this ? cant remember why, maybe @mwindower has some helpful input as well

majst01 avatar Jul 16 '20 13:07 majst01

IMHO we should add a validation of the reported registration data and prevent the metal-hammer to enter the wait phase when for example the neighbor condition cannot be verified from the metal-api perspective.

Gerrit91 avatar Jul 16 '20 13:07 Gerrit91

It was not related to #31.

Gerrit91 avatar Jul 16 '20 13:07 Gerrit91

It is related to #31 because connectMachineWithSwitches of the switch service is called during machine registration. With #31 the machine registration with < 2 connections fails.

mwindower avatar Jul 16 '20 16:07 mwindower

also covered a bit with #256

majst01 avatar Mar 17 '22 13:03 majst01