Description

This commit fixes serious bug found during testing.

When the user triggers a reboot of a device from Controller, domainmgr tries to do DomainShutdown (app deactivate), waitforDomain gone and then in worst case Deletes domain.

That is all good and expected procedure on a single device configs. In a cluster mode the apps would have failed over to another node and we cannot delete the app. So handle such cases.

Also, at the same time we need to honor another workflow to deactivate an app. So that is the reason we check for reboot in progress flag. If that flag is not present we go ahead and deactivate the app.

How to test and validate this PR

Create a 3 node cluster
Create a VM and make sure its running fine
Reboot the node running the VM from the controller.
Verify sometimes app gets deleted even though it got failed over to other node.

With this fix we can reboot nodes and observe apps are still running and also tested explicit app deactivate and it works.

Changelog notes

End users will see that apps still running on another node after a node reboot.

PR Backports

14.5-stable

Checklist

[x] I've provided a proper description
[x] I've tested my PR on amd64 device
[x] I've written the test verification instructions
[x] I've set the proper labels to this PR
[x] I've checked the boxes above, or I've provided a good reason why I didn't check them.

Please, check the boxes above after submitting the PR in interactive mode.

Jul 11 '25 21:07 zedi-pramodh

Ok so if I understand correctly you are suggesting to change zed manager and domainmgr to handle these cases in generic fashion. That probably will completely look different than this commit, though I agree it's a right thing to do.

I have to rework this PR, probably will close this PR and submit a new one.

Jul 19 '25 17:07 zedi-pramodh

Ok so if I understand correctly you are suggesting to change zed manager and domainmgr to handle these cases in generic fashion. That probably will completely look different than this commit, though I agree it's a right thing to do.

I have to rework this PR, probably will close this PR and submit a new one.

I'll set it to draft at least.

Jul 19 '25 17:07 OhmSpectator

The mismatch is that DomainConfig add/delete is not a declarative statement but an imperative statement to start/stop, and the hypervisor/kubevirt assumes it is a delcarative statement about the intended existence of the workload/task.

This needs to be fixed by 1) not having zedmanager add/delete DomainConfig to halt for device reboot (or for app instance restart with or without purge) but only do this when the AppInstanceConfig is added/deleted and 2) introduce a separate temporary stop/run boolean (or counter(s)) for these cases. So problem is very real, but the fix needs to be completely different.

Is it safe in terms of data consistency if we do not halt the app for device reboot on kvm eve ? I thought halting app will make sure data is flushed down.

Aug 01 '25 23:08 zedi-pramodh

Handle apps during explicit node reboot

Description

How to test and validate this PR

Changelog notes

PR Backports

Checklist