lxd icon indicating copy to clipboard operation
lxd copied to clipboard

Support LVM storage pool unmount

Open tomponline opened this issue 3 years ago • 0 comments

Ever since we started to unmount storage pools on LXD shutdown (https://github.com/lxc/lxd/pull/9217) we have seen LVM errors on subsequent start up (although only in the test suite not reproducible locally) when using LVM on a loop file.

The errors we see are similar to:

EROR[09-22|23:53:40] Failed to start the daemon: Failed initializing storage pool "lxdtest-w7Q": Failed activating LVM thin pool volume "lxdtest-w7Q/LXDThinPool": Failed to run: lvchange --activate y --ignoreactivationskip lxdtest-w7Q/LXDThinPool: Activation of logical volume lxdtest-w7Q/LXDThinPool is prohibited while logical volume lxdtest-w7Q/LXDThinPool_tmeta is active. 

Searching online this suggests the LVM has become corrupted somehow.

The only way I have managed to get the loop device to release itself with SetAutoclearOnLoopDev() is by deactivating the thinpool volume with lvchange -an or all volumes in the volume group using vgchange -an. Which I had thought would be sufficient to ensure that volume group was deactivated cleanly.

So far approaches I have tried unsuccessfully to resolve this are:

During pool Unmount():

  • Switch away from releaseLoopDev() and try the async SetAutoclearOnLoopDev() instead, whilst waiting for the volume group to disappear.
  • Same as above but instead/as well as monitor the /sys/class/block/loopN/loop/backing_file and check is deleted to indicate the loop device is released.
  • Only deactivate the thinpool volume (with lvchange -an), and not all of the volumes (with vgchange -an) as was previously happening.
  • As well as that call sync before calling SetAutoclearOnLoopDev() to try and get the loop device to flush to the backing file.
  • Sleep 2 seconds at the end of Unmount() to avoid LVM subsystem races.
  • Using losetup rather than openLoopFile and SetAutoclearOnLoopDev.

During Mount():

  • Wait for volume group and thin pool to appear after activating the loop file.

This doesn't happen on my local system (amd64) even when running the test suite on TMPFS, only on Jenkins.

One way I have found to reliably and quickly trigger the issue on Jenkins is to get the LVM Mount() function to try and activate the LVM thinpool volume using lvchange --activate y --ignoreactivationskip this should succeed, even if the thinpool is already active, but this quickly detects the problem storage pool issue.

Some earlier attempts:

https://github.com/lxc/lxd/pull/9276 https://github.com/lxc/lxd/pull/9274 https://github.com/lxc/lxd/pull/9267 https://github.com/lxc/lxd/pull/9258 https://github.com/lxc/lxd/pull/9253 https://github.com/lxc/lxd/pull/9254 https://github.com/lxc/lxd/pull/9247 https://github.com/lxc/lxd/pull/9245

tomponline avatar Sep 23 '21 09:09 tomponline