one Race condition in drivers operations caused by monitoring state changes

Description

It looks like that the drivers operations, which are expected to be sequential (like do VM actions -> do storage actions) are not just controlled by the mandatory successful completion of the previous driver operation, but are also affected by the state changes propagated from the monitoring. Such a change from monitoring can cause the subsequent action can start before completely finishing the previous action.

This is clearly seen in the container (LXD) environments, when VM driver action shutdown not only destroys the container, but also unmaps storage. If unmapping extends from some reason, the VM monitoring can report back to OpenNebula VM container is powered-off and OpenNebula continues with storage actions to cleanup VM datastore directory. This results in deleting a data which are still used by the kernel OR fails the action and switches VM into failed state.

To Reproduce

Get LXD environment and put a long-enough sleep to shutdown action before unmapping the storage and synchronize drivers on hosts:

--- lxd/shutdown-	2021-10-19 10:21:02.575889912 +0000
+++ lxd/shutdown	2021-10-19 10:21:24.144244273 +0000
@@ -47,6 +47,8 @@
 
 container.save_idmap
 
+sleep 120
+
 raise 'Failed to dismantle container storage' unless \
 container.setup_storage('unmap')

Run VM
Terminate VM, VM ends in failed state.

See the VM log where it's clearly seen that EPILOG is initiated by monitoring, not by successful driver action:

Tue Oct 19 10:21:40 2021 [Z0][VMM][I]: Successfully execute network driver operation: post.
Tue Oct 19 10:21:40 2021 [Z0][VM][I]: New LCM state is RUNNING
Tue Oct 19 10:22:20 2021 [Z0][VM][I]: New LCM state is SHUTDOWN
Tue Oct 19 10:22:48 2021 [Z0][LCM][I]: VM reported SHUTDOWN by the drivers
Tue Oct 19 10:22:48 2021 [Z0][VM][I]: New LCM state is EPILOG
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: Command execution failed (exit code: 1): /var/lib/one/remotes/tm/ssh/delete ubuntu2004-lxd-ssh-6-1-90-cdede-63512-2.test:/var/lib/one//datastores/0/142 142 0
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: delete: Deleting /var/lib/one//datastores/0/142
Tue Oct 19 10:22:58 2021 [Z0][TrM][E]: delete: Command "[ -e "/var/lib/one//datastores/0/142" ] || exit 0
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: 
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: times=10
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: function="rm -rf /var/lib/one//datastores/0/142"
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: 
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: count=1
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: 
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: ret=$($function)
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: error=$?
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: 
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: while [ $count -lt $times -a "$error" != "0" ]; do
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: sleep 1
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: count=$(( $count + 1 ))
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: ret=$($function)
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: error=$?
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: done
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: 
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: [ "x$error" = "x0" ]" failed: rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system
Tue Oct 19 10:22:58 2021 [Z0][TrM][I]: Error deleting /var/lib/one//datastores/0/142
Tue Oct 19 10:22:58 2021 [Z0][TrM][E]: Error executing image transfer script: INFO: delete: Deleting /var/lib/one//datastores/0/142 ERROR: delete: Command "[ -e "/var/lib/one//datastores/0/142" ] || exit 0  times=10 function="rm -rf /var/lib/one//datastores/0/142"  count=1  ret=$($function) error=$?  while [ $count -lt $times -a "$error" != "0" ]; do     sleep 1     count=$(( $count + 1 ))     ret=$($function)     error=$? done  [ "x$error" = "x0" ]" failed: rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system rm: cannot remove '/var/lib/one//datastores/0/142/mapper/disk.1/context.sh': Read-only file system Error deleting /var/lib/one//datastores/0/142
Tue Oct 19 10:22:58 2021 [Z0][VM][I]: New LCM state is EPILOG_FAILURE
Tue Oct 19 10:24:21 2021 [Z0][VMM][I]: Command execution fail (exit code: 1): cat << EOT | /var/tmp/one/vmm/lxd/cancel 'one-142' 'ubuntu2004-lxd-ssh-6-1-90-cdede-63512-2.test' 142 ubuntu2004-lxd-ssh-6-1-90-cdede-63512-2.test
Tue Oct 19 10:24:21 2021 [Z0][VMM][I]: Running LXD Version 3.0.4
Tue Oct 19 10:24:21 2021 [Z0][VMM][I]: WARNING: The LXD Driver has been deprecated in favor of the LXC Driver https://docs.opennebula.io/6.0/open_cluster_deployment/lxc_node/lxd_to_lxc.html
Tue Oct 19 10:24:21 2021 [Z0][VMM][I]: shutdown: --- Stopping container ---
Tue Oct 19 10:24:21 2021 [Z0][VMM][I]: shutdown: Processing disk 0
Tue Oct 19 10:24:21 2021 [Z0][VMM][I]: shutdown: Unknown /var/lib/one/datastores/0/142/disk.0: cannot open `/var/lib/one/datastores/0/142/disk.0' (No such file or directory)
Tue Oct 19 10:24:21 2021 [Z0][VMM][I]: image format,                     trying raw filesystem mapper

Details

Affected Component: Core
Version: 6.1.90 (but affected probably all versions)

Progress Status

[ ] Branch created
[ ] Code committed to development branch
[ ] Testing - QA
[ ] Documentation
[ ] Release notes - resolved issues, compatibility, known issues
[ ] Code committed to upstream release/hotfix branches
[ ] Documentation committed to upstream release/hotfix branches

Oct 19 '21 10:10 vholer

Continuing with the shutdown process on receiving poweroff state from monitoring is intended. It can solve freezes of the shutdown action. If it's an issue for lxd/lxc drivers, then there are 2 options how to delay the poweroff state from monitoring:

Increase times_missing in /var/lib/one/remotes/etc/im/lxd-probes.d/probe_db.conf
Increase PROBES_PERIOD/STATE_VM in /etc/one/monitord.conf

The probe_db.conf is not documented, PR with brief description. PR to use bigger times_missing for lxc/lxd

Nov 03 '21 15:11 paczerny

needed in 5.12

Nov 15 '21 16:11 tinova

PR to use bigger times_missing for lxc/lxd

This is probably not needed. There is an issue with sync state, which circumvent the missing_state counter.

Src PR fixing the sync state Docs PR to update resolved issues

Nov 15 '21 21:11 paczerny

The vm[:ignore] flag needs to be accounted for as well when checking missing VMs.

Sep 08 '22 20:09 dann1

I want to work on this issue. Getting some heads up/approval would be helpful. I am a Masters student trying to work on this open source project.

Sep 09 '22 15:09 Prashiksha

Increasing the default :times_missing: from the default 5 to 10 has been tested in multiple environments and the error has not been triggered.

Since LXD is deprecated this will do as a somewhat reliable band aid:

Edit the file /var/lib/one/remotes/etc/im/lxd-probes.d/probe_db.conf and change the values as mentioned above
Run onehost sync --force to sync the changes on the LXD node.

Oct 19 '22 14:10 dann1

one one copied to clipboard

Race condition in drivers operations caused by monitoring state changes

Progress Status

one
one copied to clipboard