vdsm icon indicating copy to clipboard operation
vdsm copied to clipboard

lvm: make SPM host check for VGs autoactivation field

Open aesteve-rh opened this issue 2 years ago • 5 comments

Add a check for the autoactivation attribute of the VGs. For each VGs in the LVMCache:

  1. Obtain the autoactivation field of each VG a. If it fails (LVM exception), raise VolumeGroupDoesNotExist
  2. Check if the retrieved autoactivation is enabled, and if so: a. Log a warning and try to disable it b. If it fails at disabling the option, raise a VolumeGroupDisableOptionError <-- new exception (will it cause problems in the engine side raising a new exception type? @bennyz @barpavel )

Changing VGs can be done only by the SPM host. Other hosts cannot fix this issue. Add a flow on the SPM checking VG configuration and fixing it:

  • When starting the SPM. All VGs available in the host are checked.
    • If an error occurs (e.g., LVM error), log a warning but do not raise.
    • Since we check all VGs the error may be unrelated (e.g., stall LVMCache), and raising would break the whole SPM start flow.
  • When activating a storage domain. Only the associated VG is checked.
    • If an error occurs, we raise.
    • In this case it will fail to activate the volume from the engine, and it’ll have to be retried. Either way, if the VG is not available in the LVM, the SD activation would fail nonetheless.

Does it affect detach storage domain and attaching to an older system?

No real change has done to the SD. It should not cause any trouble. The only thing to consider may be the new Exception added. But given that the new checks will not run in older systems, they will not trigger the Exception. So to my understanding it shall have back compatibility.

How this change should be tested?

Added lvm and blocksd tests to verify the new flow.

It may also be nice to verify it in a VM (not done yet):

  • Have a cluster with hosts in maintenance mode
  • Manually enable autoactivation in one of the SDs in the host (will the VG be available in maintenance mode?)
  • Activate a host, it will start the SPM and log a warning.
  • Detach the SD.
  • Enable autoactivation for the detached SD.
  • Reattach the SD, it will succeed and log a warning.

Depends on https://github.com/oVirt/vdsm/pull/234 (should be integrated first) -- DONE. Fixes: https://github.com/oVirt/vdsm/issues/237 Signed-off-by: Albert Esteve [email protected]

aesteve-rh avatar Jun 09 '22 12:06 aesteve-rh

Add a flow on the SPM checking VG configuration and fixing it:

  • when starting the SPM
  • when activating a storage domain

This makes sense, but there we need to consider more aspects.

  • Logging warning when finding a VG that need to fixed
  • How errors are handled if fixing a VG fails? Do we fail SPM start or activating a storage domain? This can bring down entire system if our VG update code does not work in some condition.
  • Does it affect detach storage domain and attaching to an older system?
  • How this change should be tested?

I did not look in the code yet, but lets make the intent of this change more clear.

nirs avatar Jun 28 '22 16:06 nirs

Regarding these two points:

  • Does it affect detach storage domain and attaching to an older system?

I do not have yet enough insight to be able to foresee the outcome of this scenario. Maybe we can discuss it together, or I can try to test it, enable all logs (now that I just learnt how to do it) and see how it goes.

  • How this change should be tested?

The LVM test is the easy part. But for the SPM I am not sure if it would require for me to verify it in a real setup with VMs, or I can reproduce real scenarios adding new tests to the blocksd_test.py, I'll need to check.

aesteve-rh avatar Jun 29 '22 09:06 aesteve-rh

Updated text the Issue with some considerations following @nirs comment.

I will not ready up the PR yet, but maybe you (also @bennyz and @barpavel ) can take a look at the text (probably not the code yet) and let me know if there are any weak points or fishy parts that I need to review/rework.

aesteve-rh avatar Jul 05 '22 15:07 aesteve-rh

@aesteve-rh Code written very nice & clean :):):)

barpavel avatar Jul 06 '22 08:07 barpavel

Regarding new VolumeGroupDisableOptionError error. I never had to add a new VDSM error, but looking at CannotCreateLogicalVolume as an example I see it's declared in org.ovirt.engine.ui.frontend.VdsmErrors class, VdsmErrors.properties and in few other places, so I guess there might be some work on engine side about it. @bennyz am I right?

barpavel avatar Jul 06 '22 09:07 barpavel

Drop the PR because it is not relevant anymore.

aesteve-rh avatar Oct 31 '22 14:10 aesteve-rh