machine-controller-manager icon indicating copy to clipboard operation
machine-controller-manager copied to clipboard

Machine controller manager is not getting completely frozen if the api server is down

Open plkokanov opened this issue 3 years ago • 11 comments

How to categorize this issue?

/area control-plane-migration /kind bug /priority 3

What happened: After performing a "bad case" control plane migration, the MCM that was still running in shoot's control plane on the source seed was acting upon the shoot's nodes and removing them, even though the shoot's APIServer in source seed was down. Previously when the kube apiserver was down MCM would enter a frozen state and do nothing to the nodes of the shoot cluster, but seems like this did not happen now.

The bad case scenario was as follows:

  1. Create a shoot on an aws seed cluster.
  2. Deploy a network ACL which blocks all access to and from the seed cluster. This essentially simulates an AZ failure and renders the control planes of the shoot clusters deployed on that seed useless.
  3. Trigger control plane migration by changing the shoot.spec.seedName to a new working seed. Initially this will do nothing.
  4. Annotate the shoot with shoot.gardener.cloud/force-restore=true which immediately starts reconciling the shoot onto the destination seed, but leaves the control plane resources in the source seed untouched.
  5. Migration finishes successfully and there are no problems with the control plane on the destination seed. Also there are no problems with the node, shoot's workload is not restarted and nodes aren't recreated
  6. Inside the source seed the shoot's etcd-backup-restore has detected that it is no longer running in the seed that should host the shoot's controlplane and restarts itself, then becomes NotReady as expected.
  7. This brings down the api server and KCM, CA and CCM stop working.
  8. Remove the ACL rule so that source seed recovers.
  9. MCM in source seed cluster continues to work, and brings down the shoot's nodes, even though they should now be only managed by the MCM in the destination seed. The nodes seem to be deleted every ~20 minutes. I could see the following logs from aws-machine-controller-manager after backup-restore is restarted and the api server becomes stuck in a CrashLoopBackoff:
2022-05-10 07:13:21 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:168"}
-- | --
  |   | 2022-05-10 07:14:51 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:168"}
  |   | 2022-05-10 07:17:51 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:168"}
  |   | 2022-05-10 07:19:21 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:168"}
  |   | 2022-05-10 07:20:51 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:168"}
  |   | 2022-05-10 07:22:10 | {"log":"Machine controller has frozen. Retrying reconcile after resync period","pid":"1","severity":"ERR","source":"machine.go:123"}
  |   | 2022-05-10 07:22:11 | {"log":"Machine controller has frozen. Retrying reconcile after resync period","pid":"1","severity":"ERR","source":"machine.go:123"}
  |   | 2022-05-10 07:22:12 | {"log":"Machine controller has frozen. Retrying reconcile after resync period","pid":"1","severity":"ERR","source":"machine.go:123"}
  |   | 2022-05-10 07:22:14 | {"log":"Machine controller has frozen. Retrying reconcile after resync period","pid":"1","severity":"ERR","source":"machine.go:123"}
  |   | 2022-05-10 07:22:21 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:168"}
  |   | 2022-05-10 07:23:51 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:168"}
  |   | 2022-05-10 07:31:22 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:168"}
  |   | 2022-05-10 07:32:10 | {"log":"Machine controller has frozen. Retrying reconcile after resync period","pid":"1","severity":"ERR","source":"machine.go:123"}
  |   | 2022-05-10 07:32:11 | {"log":"Machine controller has frozen. Retrying reconcile after resync period","pid":"1","severity":"ERR","source":"machine.go:123"}
  |   | 2022-05-10 07:32:12 | {"log":"Machine controller has frozen. Retrying reconcile after resync period","pid":"1","severity":"ERR","source":"machine.go:123"}
  |   | 2022-05-10 07:32:52 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:168"}
  |   | 2022-05-10 07:35:52 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:168"}
  |   | 2022-05-10 07:37:22 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:168"}
  |   | 2022-05-10 07:40:22 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:168"}
  |   | 2022-05-10 07:41:52 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:168"}
  |   | 2022-05-10 07:42:10 | {"log":"Machine controller has frozen. Retrying reconcile after resync period","pid":"1","severity":"ERR","source":"machine.go:123"}
  |   | 2022-05-10 07:42:11 | {"log":"Machine controller has frozen. Retrying reconcile after resync period","pid":"1","severity":"ERR","source":"machine.go:123"}
  |   | 2022-05-10 07:42:12 | {"log":"Machine controller has frozen. Retrying reconcile after resync period","pid":"1","severity":"ERR","source":"machine.go:123"}
  |   | 2022-05-10 07:42:14 | {"log":"Machine controller has frozen. Retrying reconcile after resync period","pid":"1","severity":"ERR","source":"machine.go:123"}
  |   | 2022-05-10 07:43:00 | {"log":"reconcileClusterMachineSafetyOrphanVMs: Start","pid":"1","severity":"INFO","source":"machine_safety.go:55"}
  |   | 2022-05-10 07:43:00 | {"log":"List machines request has been recieved for \"<shoot-name>-connectivity-z3-101a1\"","pid":"1","severity":"INFO","source":"core.go:399"}
  |   | 2022-05-10 07:43:00 | {"log":"List machines request has been processed successfully","pid":"1","severity":"INFO","source":"core.go:469"}
  |   | 2022-05-10 07:43:00 | {"log":"Machine deletion request has been recieved for \"<shoot-name>-default-z2-6bbdd-gh26r\"","pid":"1","severity":"INFO","source":"core.go:270"}
  |   | 2022-05-10 07:43:00 | {"log":"VM \"aws:///eu-central-1/i-0df9f34482c6867f5\" for Machine \"<shoot-name>-default-z2-6bbdd-gh26r\" was terminated succesfully","pid":"1","severity":"INFO","source":"core.go:296"}
  |   | 2022-05-10 07:43:00 | {"log":"Machine deletion request has been processed for \"<shoot-name>-default-z2-6bbdd-gh26r\"","pid":"1","severity":"INFO","source":"core.go:322"}
  |   | 2022-05-10 07:43:00 | {"log":"SafetyController: Orphan VM found and terminated VM: <shoot-name>-default-z2-6bbdd-gh26r, aws:///eu-central-1/i-0df9f34482c6867f5","pid":"1","severity":"INFO","source":"machine_safety.go:300"}
  |   | 2022-05-10 07:43:00 | {"log":"Machine deletion request has been recieved for \"<shoot-name>-default-z3-548dd-c9k46\"","pid":"1","severity":"INFO","source":"core.go:270"}
  |   | 2022-05-10 07:43:00 | {"log":"VM \"aws:///eu-central-1/i-01ffee83f1a6ec9f6\" for Machine \"<shoot-name>-default-z3-548dd-c9k46\" was terminated succesfully","pid":"1","severity":"INFO","source":"core.go:296"}
  |   | 2022-05-10 07:43:00 | {"log":"Machine deletion request has been processed for \"<shoot-name>-default-z3-548dd-c9k46\"","pid":"1","severity":"INFO","source":"core.go:322"}
  |   | 2022-05-10 07:43:00 | {"log":"SafetyController: Orphan VM found and terminated VM: <shoot-name>-default-z3-548dd-c9k46, aws:///eu-central-1/i-01ffee83f1a6ec9f6","pid":"1","severity":"INFO","source":"machine_safety.go:300"}
  |   | 2022-05-10 07:43:00 | {"log":"Machine deletion request has been recieved for \"<shoot-name>-edge-z3-7d574-tlmvq\"","pid":"1","severity":"INFO","source":"core.go:270"}
  |   | 2022-05-10 07:43:00 | {"log":"VM \"aws:///eu-central-1/i-0a8a383a9b8f173aa\" for Machine \"<shoot-name>-edge-z3-7d574-tlmvq\" was terminated succesfully","pid":"1","severity":"INFO","source":"core.go:296"}

and from aws-machine-controller-manager

2022-05-10 07:13:21 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name?timeout=1m0s\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:197"}
-- | --
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-connectivity-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-l-vsmp-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-default-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-connectivity-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-edge-z3\" (with replicas 1)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-edge-z1\" (with replicas 1)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-default-z2\" (with replicas 1)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-free-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-free-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-default-z1\" (with replicas 1)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-connectivity-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-edge-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-s-vsmp-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-l-vsmp-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-s-vsmp-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-s-vsmp-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-free-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-l-vsmp-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-vsmp-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-vsmp-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:33 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-vsmp-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:33 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:21:33 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:25:21 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name?timeout=1m0s\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:197"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-connectivity-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-default-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-connectivity-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-l-vsmp-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-edge-z3\" (with replicas 1)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-edge-z1\" (with replicas 1)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-default-z2\" (with replicas 1)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-free-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-free-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-default-z1\" (with replicas 1)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-connectivity-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-edge-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-l-vsmp-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-s-vsmp-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-s-vsmp-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-s-vsmp-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-free-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-l-vsmp-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-vsmp-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-vsmp-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:33 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-vsmp-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:33 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:31:33 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:34:22 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name?timeout=1m0s\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:197"}
  |   | 2022-05-10 07:35:52 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name?timeout=1m0s\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:197"}
  |   | 2022-05-10 07:37:22 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name?timeout=1m0s\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:197"}
  |   | 2022-05-10 07:38:52 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name?timeout=1m0s\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:197"}
  |   | 2022-05-10 07:40:22 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name?timeout=1m0s\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:197"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-connectivity-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-connectivity-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-default-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-l-vsmp-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-edge-z3\" (with replicas 1)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-default-z2\" (with replicas 1)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-edge-z1\" (with replicas 1)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-free-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-free-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-default-z1\" (with replicas 1)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-connectivity-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-edge-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-s-vsmp-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-l-vsmp-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-s-vsmp-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-free-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-s-vsmp-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-l-vsmp-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-vsmp-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:32 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-vsmp-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:33 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-vsmp-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:33 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-z1\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:33 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-z2\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:41:33 | {"log":"Processing the machinedeployment \"<shoot-name>-hana-z3\" (with replicas 0)","pid":"1","severity":"INFO","source":"deployment.go:450"}
  |   | 2022-05-10 07:43:22 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name?timeout=1m0s\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:197"}
  |   | 2022-05-10 07:46:22 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name?timeout=1m0s\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:197"}
  |   | 2022-05-10 07:47:52 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name?timeout=1m0s\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:197"}
  |   | 2022-05-10 07:49:22 | {"log":"SafetyController: Unable to GET on node objects Get \"https://kube-apiserver/api/v1/nodes/dummy_name?timeout=1m0s\": dial tcp 10.243.84.247:443: i/o timeout","pid":"1","severity":"ERR","source":"machine_safety.go:197"}

Api server was not up at around 11. After I restarted MCM in the source seed it stopped trying to remove the nodes, but that was because it could not start its Watches (as shoot APIServer is down)

What you expected to happen: MCM located in the source seed to not do anything.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration: aws (but probably not limited only to that)
  • Others:

plkokanov avatar May 10 '22 11:05 plkokanov

This is happening because we don't stop orphan machine deletion when the machineController is Frozen. I think we should enable that. Will raise a PR for the same

himanshu-kun avatar May 13 '22 06:05 himanshu-kun

cc @vlerenc @unmarshall

himanshu-kun avatar May 18 '22 09:05 himanshu-kun

@plkokanov as per https://github.com/gardener/machine-controller-manager/pull/722#discussion_r877749122 we could still have a situation where orphan collection happens after the fix done in the PR #722 So I would like to know if that risk is acceptable. If not , one other way could be to move the machine objects after shutting down MCM on old seed, so this kind of situation doesn't arise.

himanshu-kun avatar May 20 '22 06:05 himanshu-kun

If not , one other way could be to move the machine objects after shutting down MCM on old seed, so this kind of situation doesn't arise.

We already do that in the case where the control plane of the source seed is reachable and the gardenlet there is still functioning properly. This is the so-called "good case" scenario of control plane migration.

However, for the case where gardenlet in the source seed has stopped working, or there is no way to talk to the control plane components of the shoot cluster even though they might be still operational ("bad case scenario"), there is no way to move the machines out or to shutdown MCM. In that case we rely on special login in the etcd to shut itself down, which will also bring the kube-apiserver down.

So I would like to know if that risk is acceptable.

I will have to think about it a little bit. Generally, "bad case" control plane migration will be used in very very rare cases, so right now we want to avoid introducing too much additional complexity around it.

plkokanov avatar May 20 '22 07:05 plkokanov

Following you explanation, in a bad case where control plane is not reachable ,couldn't we leave the machine objects in old seeds intact , and just shut down etcd (so old MCM wouldn't be able to work due to freeze) and start with the new control plane and then whenever the seed comes online we could delete the remaining resources. Is there a mechanism like this where one could tell if a particular control plane is under migration?

himanshu-kun avatar May 23 '22 06:05 himanshu-kun

Following you explanation, in a bad case where control plane is not reachable ,couldn't we leave the machine objects in old seeds intact , and just shut down etcd (so old MCM wouldn't be able to work due to freeze)

That is exactly what we do, but as I mentioned it doesn't freeze completely (without the PR that you have opened), so if the machines in the destination seed are rolled out for w/e reason or new ones are created, MCM in the source seed will detect the new machines as orphaned and delete them.

and start with the new control plane and then whenever the seed comes online we could delete the remaining resources. Is there a mechanism like this where one could tell if a particular control plane is under migration?

Currently working on it :). But even after the source seed comes back up the shoot's ETCD in the control plane will still be down and we expect the control plane to be completely deactivated. We did not want to rely on the mechanism that will clean up the control plane in the source seed to prevent split-brain scenarios.

plkokanov avatar May 23 '22 06:05 plkokanov

Can we please have a meeting? I thought we discussed good/bad case scenarios and made a very clear decision:

  • The gardenlet talks to the garden cluster. That's where we take the information from whether a shoot shall be hosted on that seed or not (primary data source). In addition/if the connection is broken, the gardenlet shall run the DNS check instead as a fall-back (secondary data source), because DNS is more stable than almost anything else in networking.
  • We support only the good case with the small tweak, that also/alternatively/as fall-back the DNS record is checked.
  • Only when it is 100% clear, that a seed is no longer responsible (via primary or secondary data source/probe), we shut down the control plane.
  • The bad case is gone. We decided, it's not worth a.) the complexity and b.) the risk of prematurely shutting down a healthy cluster. That was the main motivation that triggered that decision some months back after the melt-down question was brought up. We argued (this list may not be complete, but I try to recall the arguments):
    • Too complex
    • Too risky (melt-down)
    • With HA, DR will become much less likely to be ever needed in an emergency
    • With HA, HA will safe us already
    • If a seed is really unavailable, then the likelihood is high that it's also cut off from networking, if it wasn't destroyed by the operators (operational problem), in which case there is nothing able to interfere resp. nothing running anymore to interfere

The rest is "risk" that can be mitigated by actively rotating/revoking credentials manually should the operator ever be in that unfortunate situation.

Therefore, how come we still discuss/test the bad case? Can we please have a meeting?

vlerenc avatar May 23 '22 07:05 vlerenc

It all started with an original question:

A simple fix that is proposed to tackle the issue of orphan collecting machines during migration of control plane still has a gap allowing a possibility of orphan collection as mentioned in #722(Comment)

If that risk is not acceptable then we just change the solution by making it a bit more involved. I see that it was already answered by @plkokanov that the risk is acceptable. @himanshu-kun do we need any further clarifications before we merge the PR?

unmarshall avatar May 23 '22 08:05 unmarshall

so if the machines in the destination seed are rolled out for w/e reason or new ones are created, MCM in the source seed will detect the new machines as orphaned and delete them.

Oh yeah, forgot that case. Then in that case it all boils down to the risk being acceptable or not as mentioned in @unmarshall 's comment here . No further clarifications needed from my side.

@vlerenc I hope nothing else needed from MCM side here

himanshu-kun avatar May 23 '22 08:05 himanshu-kun

@himanshu-kun Except, that I don't know whether disabling orphan machine deletion is a good idea for frozen machine sets. The reason we left it running originally is to cleanup piling-up resources in case something in the infrastructure goes wrong to not hit a quota issue/incur a huge bill. Now we disabled that safe-guard or are we doing nothing/leave it running?

The way to shut down the control plane in the good case is to shut down the control plane. Either directly or indirectly (via DWD), MCM will stop working (including orphan machine deletion). Is that wrong?

vlerenc avatar May 23 '22 09:05 vlerenc

@vlerenc I initially opened this issue as I thought that disabling orphaned machine deletion was the expected behaviour if MCM cannot talk to the APIServer (at least when we talked about it back in the days there was the idea that MCM will not do anything in that case) but maybe something changed meanwhile. What you outlined in https://github.com/gardener/machine-controller-manager/issues/718#issuecomment-1134310725 still holds and as I mentioned above, I do not want to add additional complexity or introduce new behaviour that might hurt us in the end.

If this change can bring potential for undesired behaviour then we can close it.

As for the risk that @unmarshall noticed, I mentioned in the PR: https://github.com/gardener/machine-controller-manager/pull/722#discussion_r879076290 that it is acceptable imo.

plkokanov avatar May 23 '22 09:05 plkokanov

We discussed a solution internally as part of grooming exercise

  • Each MCM will add a managed-by: <unique name of that MCM> tag to the VM when it creates the VM.
  • At the same time it will also add the tag to every other VM , for which it has the machine object present in its control cluster etcd
  • Orphan collection logic would then consider this tag , and each MCM could only orphan collect VMs having the above tag with its(MCM) unique name.

In the context of the issue above, the source seed MCM, would then not add any tag on the new VMs added by the target seed MCM, and will never orphan collect them

himanshu-kun avatar Feb 16 '23 11:02 himanshu-kun

Sorry for the late update, but this issue triggered some internal discussions on the future of the "bad case" control plane migration in light of the HA topic

In the end, we decided that the "bad case" control plane migration is most likely never going to be used because once the option is available to switch to HA control planes, users should be instructed to use that if they require the highest possible degree of availability.

As part of the removal, we have already started removing the "owner-checks" from gardener, extensions and the etcd backup-restore: https://github.com/gardener/gardener/issues/6302. This means that there is also no need to introduce additional complexity to MCM if it would only be required for the "bad case" migration.

plkokanov avatar Feb 16 '23 11:02 plkokanov

/close Currently we'll keep running the orphan collection even when machine controller is frozen

himanshu-kun avatar Feb 22 '23 07:02 himanshu-kun