kubecf icon indicating copy to clipboard operation
kubecf copied to clipboard

When shutting down a node and bringing it back up, KubeCF will not restart apps when using Eirini.

Open gaktive opened this issue 4 years ago • 10 comments

Describe the bug Placeholder until we get @satadruroy to confirm what @troytop spotted -- when a Kubernetes node goes down with KubeCF using Eirini, once the node comes back up, all apps associated with that node need to be manually started. This should be an automatic process as we see in Diego.

To Reproduce

  • Deploy KubeCF with Eirini in an HA Kubernetes environment
  • Take down one node and bring it back up
  • Observe the app state.

Expected behavior When a node goes down and comes back up, apps come back automatically.

gaktive avatar Nov 04 '20 02:11 gaktive

I was able to repro this - sort of. My setup - AKS with 3 nodes, Eirini HA deployment.

  1. Deploy app with 3 instances.

cf apps

#0   running   2020-11-17T00:23:11Z   0.3%   264.6M of 1G   160K of 1G
#1   running   2020-11-17T00:23:11Z   0.2%   274.9M of 1G   160K of 1G
#2   running   2020-11-17T00:23:12Z   0.2%   267.2M of 1G   160K of 1G

Check to make sure the pods with the apps are scheduled on 3 different nodes.

  1. Stop one of the nodes. (AKS does not detect node shutdown and automatically boot up another one)

cf apps output snippet:

     state     since                  cpu    memory         disk         details
#0   running   2020-11-17T00:23:11Z   0.0%   0 of 1G        0 of 1G
#1   running   2020-11-17T00:23:11Z   0.2%   275M of 1G     160K of 1G
#2   running   2020-11-17T00:23:12Z   0.2%   267.4M of 1G   160K of 1G

The output of #0 shows running whereas in reality it is stopped as the node is stopped.

  1. Restart the node - wait for it to complete...
     state     since                  cpu    memory         disk         details
#0   crashed   2020-11-17T00:23:12Z   0.0%   0 of 1G        0 of 1G
#1   running   2020-11-17T00:23:12Z   0.2%   275.2M of 1G   160K of 1G
#2   running   2020-11-17T00:23:13Z   0.2%   267.6M of 1G   160K of 1G

... even though the pods are back up and running.

Screen Shot 2020-11-16 at 5 46 07 PM

The events-reporter in eirini-events namespace has the following error:

Screen Shot 2020-11-16 at 5 51 30 PM

But eventually it recovers on its own.

     state     since                  cpu    memory         disk         details
#0   running   2020-11-17T01:36:41Z   0.3%   319.9M of 1G   164K of 1G
#1   running   2020-11-17T00:23:12Z   0.2%   275.4M of 1G   160K of 1G
#2   running   2020-11-17T00:23:13Z   0.2%   267.6M of 1G   160K of 1G

So this could be just eirini slow to catch up with the node status.

However, @troytop mentioned this doesn't recover on CaaSP without a manual app restart. @viccuad or @svollath do you mind trying a repro of this on CaaSP?

satadruroy avatar Nov 17 '20 01:11 satadruroy

ping @jimmykarily

viovanov avatar Nov 19 '20 17:11 viovanov

Other the event reporter error everything else seems to be fine? I mean, the app instance appears as crashed at some point and eventually recovers. Regarding the event reporter error, I remember seeing that before even when nothing else seems wrong. I think the first instance of the app is missing the app index at the end of the pod name (-0 etc). Should it always be there @cloudfoundry-incubator/eirini ?

jimmykarily avatar Nov 20 '20 08:11 jimmykarily

Created a story to investigate the event reporter errors here: https://www.pivotaltracker.com/story/show/175814747

The rest of this issue may be irrelevant though.

jimmykarily avatar Nov 20 '20 09:11 jimmykarily

Hi. Here's the eirini team's findings from the above story:

The event-reporter was changed in eirini-1.9 and it now only listens to updates on pods labelled with cloudfoundry.org/source_type: APP. The error you saw was probably to do with a staging pod. The event reporter will ignore these now.

We've experimented with deleting a k8s node. Eirini actually isn't involved in recreation of any apps. It's purely a k8s concern. We see k8s successfully rescheduling lost pods on remaining nodes as soon as it's aware of the deleted node disappearing.

So it looks like everything is behaving correctly, and there's nothing for us to do in eirini for this.

kieron-dev avatar Nov 26 '20 11:11 kieron-dev

The event-reporter was changed in eirini-1.9 and it now only listens to updates on [...]

Just pointing out that kubecf is still using Eirini-1.8, in case that is relevant.

jandubois avatar Nov 26 '20 17:11 jandubois

Reproduced on CaaSP 4.5.1, (tf4 machine, 3 worker nodes). Everything seems fine from the CAP and Eirini side.

Hard-powering a node off means that CaaSP marks the node as NotReady, and marks the pods as Terminating, yet it never finishes terminating them. Hence, they don't get moved. Depending on your luck, a cap HA deployment may survive. Cordoning the node after losing it makes no change. Cordoning the node instead of a hard-poweroff moves the pods to other nodes correctly.

viccuad avatar Dec 04 '20 14:12 viccuad

Looks like a Kubernetes config/distro issue rather than a KubeCF one. We can provide docs advice on how to recover from a hard shutdown (i.e. disaster recovery). Need to tell people in CAP Docs (probably beyond the scope of kubecf docs) how to replace or remove missing kubernetes nodes so that kubecf recovers properly. If this is something that can be automated (e.g. with the CCM or external Kubernetes monitoring) that should be mentioned with a reference or link to supporting information.

troytop avatar Dec 04 '20 18:12 troytop

If we kubectl delete the node the workload should move over to a healthy node.

@viccuad once you brought the node back up did the apps recover?

satadruroy avatar Dec 04 '20 21:12 satadruroy

Yes, replicated it and they do.

viccuad avatar Dec 07 '20 10:12 viccuad