lxd Stopping cluster member for longer than `cluster.healing_threshold` causes it to go into `EVACUATED` state

When using cluster.healing_threshold if a cluster member is cleanly stopped and not started back again within the cluster.healing_threshold then it is placed into EVACUATED state, which then requires the member to be restored using lxc cluster restore <member> before instance services can be resumed.

lxc exec micro01 -- lxc config set cluster.healing_threshold=60
lxc exec micro01 -- poweroff
sleep 60
lxc exec micro02 -- lxc ls
+---------+---------+------------------------+-------------------------------------------------+-----------------+-----------+
|  NAME   |  STATE  |          IPV4          |                      IPV6                       |      TYPE       | SNAPSHOTS |
+---------+---------+------------------------+-------------------------------------------------+-----------------+-----------+
| micro01 | STOPPED |                        |                                                 | VIRTUAL-MACHINE | 0         |
+---------+---------+------------------------+-------------------------------------------------+-----------------+-----------+
| micro02 | RUNNING | 10.106.14.139 (enp5s0) | fd42:69f5:15cb:1db5:216:3eff:fe55:bc88 (enp5s0) | VIRTUAL-MACHINE | 0         |
+---------+---------+------------------------+-------------------------------------------------+-----------------+-----------+
| micro03 | RUNNING | 10.106.14.120 (enp5s0) | fd42:69f5:15cb:1db5:216:3eff:fe17:a265 (enp5s0) | VIRTUAL-MACHINE | 0         |
+---------+---------+------------------------+-------------------------------------------------+-----------------+-----------+

I'm not sure what the original intended behaviour for this was, but it means that if the entire cluster is restarted then its likely all of the cluster members will be placed into EVACUATED state and they will all need to be recovered before instances can be started again.

Additionally running instances don't appear to be evacuated when LXD is cleanly shutdown, instead they appear to be stopped and then later recovered with a fresh boot after the cluster.healing_threshold has elapsed.

In general I feel more thought needs to be given around the behaviours of auto healing when cluster members are cleanly stopped/restarted.

Mar 08 '24 16:03 tomponline

I'm looking into that

Mar 25 '24 13:03 gabrielmougard

@tomponline here is a thought on how we could handle this:

Let's say a user initiate a graceful shutdown from within the VM / physical machine being the cluster member: poweroff / shutdown .

We could implement a listener as part of the LXD daemon that would react to the following d-bus events:


import (
	"github.com/godbus/dbus/v5"
)

func some_lxd_dbus_hook() {
        conn, err := dbus.SystemBus()
	if err != nil {
		logger.Errorf(os.Stderr, "Failed to connect to System Bus: %v\n", err)
		return
	}

	// Subscribe to systemd signals for shutdown and sleep
	matchShutdown := dbus.WithMatchInterface("org.freedesktop.login1.Manager")
	matchMemberShutdown := dbus.WithMatchMember("PrepareForShutdown")
	matchSleep := dbus.WithMatchInterface("org.freedesktop.login1.Manager")
	matchMemberSleep := dbus.WithMatchMember("PrepareForSleep")

	conn.AddMatchSignal(matchShutdown, matchMemberShutdown)
	conn.AddMatchSignal(matchSleep, matchMemberSleep)

	logger.Info("Listening for shutdown and sleep signals...")

	c := make(chan *dbus.Signal, 10)
	conn.Signal(c)
	for signal := range c {
		switch signal.Name {
		case "org.freedesktop.login1.Manager.PrepareForShutdown":
			started := signal.Body[0].(bool)
			if started {
				logger.Info("System is preparing for shutdown...")
				// Send the information to an other cluster member
			}
		case "org.freedesktop.login1.Manager.PrepareForSleep":
			started := signal.Body[0].(bool)
			if started {
				logger.Info("System is preparing for sleep...")
				// Send the information to an other cluster member
			}
		}
	}
}

Now, when we detect such an event, and if a cluster.healing_threshold has been set and after this delay has elapsed, we could mark the cluster member as OFFLINE (not EVACUATED) but still trigger the migration of the instances that were living on that cluster member. We just need to retain the information that after the VM / machine is restarted (not restored), the migrated instances need to come back to this original cluster member.

If the cluster member is a VM managed by a host LXD server, I'm wondering if it'd be a better idea to implement this detection logic in the vm agent. But regarding VMs, I don't know how dbus works.. Is the shutdown signal relative to the VM or to the physical host?

There is an other issue I can think of: for instances that are on the local storage of this node, how is it possible to migrate them before the shutdown happens? If we wanted to migrate these instances before the graceful shutdown happen, we might have to define a new systemd unit (let's call it snap.lxd.pre-shutdown-migration.service) that intercept the shutdown signal, finish the migration of the instances based on this cluster member local storage, and let the cluster member shutdown.

Did you have a specific idea in mind to handle this scenario?

Mar 25 '24 15:03 gabrielmougard