capsule-render icon indicating copy to clipboard operation
capsule-render copied to clipboard

Daemonset plugin launched during cluster scale down

Open johnSchnake opened this issue 2 years ago • 1 comments

See comment: Originally posted by @jcstanaway in https://github.com/vmware-tanzu/sonobuoy/issues/1682#issuecomment-1120055430

TL;DR; if a cluster has N nodes when sonobuoy is launched but is in the process of scaling down, Sonobuoy can get in a state where it thinks a DS plugin should return N results but only N-1 ever get reported on.

The question is whether or not we can identify this sort of situation and/or how to rectify it.

johnSchnake avatar May 09 '22 18:05 johnSchnake

One idea that may mitigate this issue is that when we query the API for the list of nodes, we may be able to tell that a node is in the process of being terminated. That would avoid the problem in the case where the scale down had begun before sonobuoy was started.

If a scale down is initiated in between the launch/reporting of results though, we will think N should be reported but later know that to only be N-1. A question I'd have would be whether we can ever KNOW that we shouldn't still wait on that other node to come up.

However, I think we can mitigate this issue in the DS monitoring logic. We could probably add a check to re-check the # of nodes and if there is a mismatch we can raise warnings/errors and maybe after some delay just report that node as errored so we don't have to wait for the whole timeout logic to be triggered.

johnSchnake avatar May 09 '22 18:05 johnSchnake

There has not been much activity here. We'll be closing this issue if there are no follow-ups within 15 days.

stale[bot] avatar Nov 07 '22 08:11 stale[bot]