resource-agents icon indicating copy to clipboard operation
resource-agents copied to clipboard

LVM-activate: Improve return codes and handling of missing PVs

Open nrwahl2 opened this issue 4 years ago • 6 comments

OCF_ERR_ARGS and OCF_ERR_CONFIGURED were used in inappropriate ways that led to starts being disallowed in the wrong places. This pull attempts to correct that.

lvm_status() and lvm_stop() weren't designed to handle missing VGs or LVs appropriately, due to the decision not to use any LVM commands in the monitor operation. This pull attempts to address that by using dmsetup deps to indirectly check the status of the PVs behind a VG. It's unlikely to be a perfect solution, but neither is the current approach. This aims to be an improvement.

We could use some additional testing for the scenario where a VG contains (or the cluster manages) more specialized types of logical volumes (e.g., thin pools and mirrors).

Finally, the lvm_validate() function performs a partial activation check for a VG, but there's no special handling in case the resource specifies an LV. This adds some logic to handle LV partial activation.

I strongly suggest that we have one or more LVM/device-mapper experts review this. My biggest concern is there may be edge cases in which my approach (running dmsetup deps, checking for devices with commas, and assuming that this means a missing PV because it's only a major/minor number) is not valid.

Resolves: RHBZ#1905820 Resolves: RHBZ#1902433 Resolves: RHBZ#1905825

Signed-off-by: Reid Wahl [email protected]

nrwahl2 avatar Dec 14 '20 04:12 nrwahl2

TL;DR, this seems to be fine. See the edit at the bottom.

There is one issue with this that is a quasi-regression, and I'm not sure if there's any way to fix it or not. If a PV goes missing and then comes back, it may have a different blkdevname and major/minor number. The monitor operation will fail even though the PV/VG has been recovered, because dmsetup deps -o blkdevname still prints the ($major, $minor) format instead of the "real" blkdevname.

[root@fastvm-rhel-8-0-24 ~]# dmsetup deps -o blkdevname | grep test_vg
test_vg1-test_lv1: 1 dependencies	: (sdd)
[root@fastvm-rhel-8-0-24 ~]# iscsiadm -m node --logoutall=all
[root@fastvm-rhel-8-0-24 ~]# iscsiadm -m node --loginall=all
[root@fastvm-rhel-8-0-24 ~]# dmsetup deps -o blkdevname | grep test_vg
test_vg1-test_lv1: 1 dependencies	: (8, 48)
[root@fastvm-rhel-8-0-24 ~]# lvs test_vg1
  LV       VG       Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  test_lv1 test_vg1 -wi-a----- 296.00m  

However, on the bright side, recovery is straightforward for the LVM-activate resource. (Notwithstanding the fact that it would bring down other resources in a group.)

[root@fastvm-rhel-8-0-24 ~]# pcs resource debug-monitor test_vg1
Operation monitor for test_vg1 (ocf:heartbeat:LVM-activate) returned: 'error' (1)
Dec 14 02:02:46 ERROR: One or more PVs missing for test_vg1/test_lv1

[root@fastvm-rhel-8-0-24 ~]# pcs resource debug-stop test_vg1
Operation stop for test_vg1 (ocf:heartbeat:LVM-activate) returned: 'ok' (0)
Dec 14 02:02:48 ERROR: One or more PVs missing for test_vg1/test_lv1
Dec 14 02:02:48 WARNING: test_vg1/test_lv1: Unexpected error. Deactivating any remaining volumes and removing device-mapper entries.
Dec 14 02:02:49 INFO: test_vg1/test_lv1: deactivated successfully

[root@fastvm-rhel-8-0-24 ~]# pcs resource debug-start test_vg1
Operation start for test_vg1 (ocf:heartbeat:LVM-activate) returned: 'ok' (0)
...

[root@fastvm-rhel-8-0-24 ~]# pcs resource debug-monitor test_vg1
Operation monitor for test_vg1 (ocf:heartbeat:LVM-activate) returned: 'ok' (0)

The only way I know to prevent this is to use LVM commands as a fallback and recreate device-mapper entries if the device is found this way. Not sure yet how this might impact dependent resources like a filesystem.


EDIT: This seems to be "a feature not a bug," as they say. We actually should recover the VG and everything that depends on it, if we lose the PV for even a moment. Apparently, I/O to that volume fails even after the PV is recovered. See demo below.

[root@fastvm-rhel-8-0-24 ~]# pcs resource debug-start test_vg1
Operation start for test_vg1 (ocf:heartbeat:LVM-activate) returned: 'ok' (0)
Dec 14 02:25:39 INFO: test_vg1/test_lv1: activated successfully.

[root@fastvm-rhel-8-0-24 ~]# mount /dev/mapper/test_vg1-test_lv1 /mnt

[root@fastvm-rhel-8-0-24 ~]# iscsiadm -m node --logoutall=all

[root@fastvm-rhel-8-0-24 ~]# ls /mnt
ls: reading directory '/mnt': Input/output error

[root@fastvm-rhel-8-0-24 ~]# iscsiadm -m node --loginall=all

[root@fastvm-rhel-8-0-24 ~]# lvs test_vg1/test_lv1
  LV       VG       Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  test_lv1 test_vg1 -wi-ao---- 296.00m    

[root@fastvm-rhel-8-0-24 ~]# ls /mnt
ls: reading directory '/mnt': Input/output error

nrwahl2 avatar Dec 14 '20 10:12 nrwahl2

EDIT: Regarding the below, I decided to drop all the partial_activation stuff from this PR. We can deal with it separately. There are already a lot of changes here.

I had added that stuff in the first place because it seemed like simple, logical changes that meshed with the rest of the changes. But testing proved that the partial_activation behavior is a lot more complicated and ill-defined than I previously believed.

This PR is now ready for review as-is. It leaves partial_activation logic unchanged -- likely broken, but no worse off than before.

I'll keep the partial_activation commits locally in case we do anything with them later.


@oalbrigt: I've done quite a bit of testing with partial_activation=false (the default). With that setting, the RA seems to be behaving in the ways I think it should.

However, partial_activation=true has been giving me a lot of problems. The trouble is that this option is only vaguely specified, and so it's not clear exactly how it should behave. So there's still some work to be done on the following questions:

  • Under what circumstances should the monitor operation fail when partial_activation=true?
  • When partial_activation=true, how should the monitor checks differ when lvname is specified versus when only vgname is specified? The option description refers only to the VG.

There are no obvious answers to these questions. One (probably incomplete) approach is included in the current commits. Some of the behavior is questionable. The trickiest parts involve situations when all the PVs have been removed.

We could skip a lot of the PV checking code when partial_activation=true. But then we're going to see either some questionable monitor successes or some questionable stop failures (or both), depending on the choices we make.

I'll spare the details for now. I'm happy to discuss either in the PR or through IRC.

Anyway, if we can reach an agreement on exactly what we're aiming for, then we can implement it :)

nrwahl2 avatar Jan 02 '21 10:01 nrwahl2

Could @teigland take a look at this and see if the approach is sane from an LVM/device-mapper perspective?

The main purpose of this PR is to address a limitation that you're already aware of. The monitor operation is based on dmsetup info. If the VG has missing PVs, the device-mapper entries still exist, and so the current RA doesn't detect a problem. Two customers have recently opened Red Hat cases/BZs for this. The existing OCF_CHECK_LEVEL=10 approach partially addresses this, but it has a few drawbacks:

  • It requires a device read on each monitor operation.
  • It only checks the first LV in the VG.
  • It doesn't do direct I/O, so it can return a false success if the LV has a mounted FS.

The last two are easily fixable and are in fact addressed within this PR. But it would be nice to simply check whether all the PVs are present, without using any LVM commands.

The approach I took is to run dmsetup deps -o blkdevname <VG>-<LV> and look for a comma, indicating that a device is printed as ($maj, $min) rather than as a real name. This appears to be a reliable indicator of whether the mapping still points to a valid target device. However, I'm unsure whether this is a reliable test for all situations.

It has worked reliably in my testing so far, and one of our customers has tested and is satisfied with it. That of course doesn't prove that this is a valid approach, but there haven't been any red flags yet.

The main commit that could use attention is 91518e3.

Thanks!

nrwahl2 avatar Jan 08 '21 08:01 nrwahl2

In my opinion, this HA system as a whole has a significant design problem, or at least a serious feature gap: it seems to be missing an approach to dealing with storage devices. The issues related to device connection state, device errors, device availability, device response, etc belong in other agents, or HA subsystems, or represented through other layers that can report issues. The state of a device can best be monitored through sysfs, or driver-specific tools. dm-multipath has device monitoring capabilities which might be useful for an building device-level HA.

Since the lvm agent is often the closest the HA system comes looking at storage, it is often assumed that any storage issue should be handled in the lvm agent. I don't agree with this; I think the LVM-activate agent should have the narrow job of dealing with activation of LVs and monitoring the resulting dm state.

That's not to say that the lvm agent could never be part of a solution in any way; there might be some ways to enhance it that are consistent with what it's currently doing. e.g. if dm had some ways of reporting lower level issues for a dm device, I could imagine checking those (something related to dm-multipath comes to mind.)

I'd need to look much more closely at what dmsetup deps really tells us about the state of a device, I'm highly skeptical that the output format is an official indication of state that we could rely on. If dm in the kernel contains some key information that we need to know, then we need to identify exactly what that info is, and find an official interface to get it.

teigland avatar Jan 11 '21 18:01 teigland

The new push is less modular. There's one status function that does pretty much everything in one monolithic block.

There are still three new getter functions (get_dm_devs, get_dm_dependencies, and get_random_file). These are brand-new functionality and not a replacement or refactor for anything else. So there's no downside to putting those in new functions instead of adding them to lvm_status. Using new functions introduces less code change in this case.

I regret the size of the "Better handling of transiently missing PVs" commit, but I have not found any simpler way to address the limitations of this RA. I hope that this using dmsetup deps is even a valid approach (hence the request to teigland for review).

P.S. I relayed teigland's very valid concerns to the dev mailing list.

nrwahl2 avatar Jan 20 '21 02:01 nrwahl2

I guess we can close this now.

oalbrigt avatar Jun 29 '22 11:06 oalbrigt