vdsm Live migration failures after upgrading from 4.5.2 to 4.5.4

Live migration failures after upgrading from 4.5.2 to 4.5.4

Open ebyrne242 opened this issue 2 years ago • 4 comments

Everything was running on oVirt 4.5.2 on Rocky 8.6 with no issues for months. On Friday I updated our standalone engine to Rocku 8.7 and oVirt 4.5.4, decommissioned one of the hosts, then built a replacement on new hardware (same CPU) on 4.5.4. I was successfully able to migrate VMs to and from the host and called it good, then upgraded the rest of my hosts in sequence to Rocku 8.7 and oVirt 4.5.4. The only issue I ran into is I had to manually update and run dnf update --allow-erasing due to some ovs package changes. No issues were apparent in operation at this time.

On Monday morning, one VM appeared to have hung and was manually restarted. A couple of hours later a few VMs stopped running with the error in the web UI: "VM is down. Exit message: User shut down from within the guest". This may have been triggered by an oVirt migration attempt after the VMs were marked unresponsive. Within 10 minutes after this, the host they were on was marked as unresponsive. oVirt tried to migrate them and failed due to the error below. My suspicion was that the VM host was holding onto the storage lease despite the VMs being dead. I saw errors like this in the log on the host marked Nonoperational:

2023-02-06 11:47:31,184-0800 ERROR (jsonrpc/2) [api] FINISH detach_volume error=Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/bin/python3\', \'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=b\'\' err=b\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmpebl0nwmd/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 3275650\\nTraceback (most recent call last):\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 139, in <module>\\n    sys.exit(main(sys.argv[1:]))\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 62, in main\\n    args.command(args)\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 134, in detach\\n    ignore_errors=False)\\n  File "/usr/lib/python3.6/site-packages/vdsm/storage/nos_brick.py", line 110, in disconnect_volume\\n    run_as_root=True)\\n  File "/usr/lib/python3.6/site-packages/os_brick/executor.py", line 53, in _execute\\n    result = self.__execute(*args, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/os_brick/privileged/rootwrap.py", line 172, in execute\\n    return execute_root(*cmd, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 272, in _wrap\\n    r_call_timeout)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 215, in remote_call\\n    raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/vmstore/volume-9b11821b-a9dc-4641-9c6f-00bb326a4eda --conf /tmp/brickrbd_ukuiqq_t --id vmstore --mon_host 192.168.88.75:6789 --mon_host 192.168.8.76:6789 --mon_host 192.168.8.77:6789\\nExit code: 16\\nStdout: \\\'\\\'\\nStderr: \\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',) (api:114)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 107, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 1808, in detach_volume
    return managedvolume.detach_volume(sd_id, vol_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/managedvolume.py", line 63, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/managedvolume.py", line 138, in detach_volume
    run_helper("detach", vol_info)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/managedvolume.py", line 184, in run_helper
    sub_cmd, cmd_input=cmd_input)
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 38, in __call__
    return callMethod()
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 36, in <lambda>
    **kwargs)
  File "<string>", line 2, in managedvolume_run_helper
  File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772, in _callmethod
    raise convert_to_error(kind, result)
vdsm.storage.exception.ManagedVolumeHelperFailed: Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/bin/python3\', \'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=b\'\' err=b\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmpebl0nwmd/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 3275650\\nTraceback (most recent call last):\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 139, in <module>\\n    sys.exit(main(sys.argv[1:]))\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 62, in main\\n    args.command(args)\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 134, in detach\\n    ignore_errors=False)\\n  File "/usr/lib/python3.6/site-packages/vdsm/storage/nos_brick.py", line 110, in disconnect_volume\\n    run_as_root=True)\\n  File "/usr/lib/python3.6/site-packages/os_brick/executor.py", line 53, in _execute\\n    result = self.__execute(*args, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/os_brick/privileged/rootwrap.py", line 172, in execute\\n    return execute_root(*cmd, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 272, in _wrap\\n    r_call_timeout)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 215, in remote_call\\n    raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/vmstore/volume-9b11821b-a9dc-4641-9c6f-00bb326a4eda --conf /tmp/brickrbd_ukuiqq_t --id vmstore --mon_host 192.168.8.75:6789 --mon_host 192.168.8.76:6789 --mon_host 192.168.8.77:6789\\nExit code: 16\\nStdout: \\\'\\\'\\nStderr: \\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',)

And on the target migration host:

2023-02-06 18:09:37,891-0800 ERROR (jsonrpc/6) [api] FINISH detach_volume error=Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/bin/python3\', \'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=b\'\' err=b\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmpezistr1k/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 2922225\\nTraceback (most recent call last):\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 139, in <module>\\n    sys.exit(main(sys.argv[1:]))\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 62, in main\\n    args.command(args)\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 134, in detach\\n    ignore_errors=False)\\n  File "/usr/lib/python3.6/site-packages/vdsm/storage/nos_brick.py", line 110, in disconnect_volume\\n    run_as_root=True)\\n  File "/usr/lib/python3.6/site-packages/os_brick/executor.py", line 53, in _execute\\n    result = self.__execute(*args, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/os_brick/privileged/rootwrap.py", line 172, in execute\\n    return execute_root(*cmd, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 272, in _wrap\\n    r_call_timeout)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 215, in remote_call\\n    raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/vmstore/volume-37800cff-ec62-4d84-9f52-6a179fde1472 --conf /tmp/brickrbd_kchs7qub --id vmstore --mon_host 192.168.8:6789 --mon_host 192.168.8.76:6789 --mon_host 192.168.8.77:6789\\nExit code: 16\\nStdout: \\\'\\\'\\nStderr: \\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',) (api:114)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 107, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 1808, in detach_volume
    return managedvolume.detach_volume(sd_id, vol_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/managedvolume.py", line 63, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/managedvolume.py", line 138, in detach_volume
    run_helper("detach", vol_info)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/managedvolume.py", line 184, in run_helper
    sub_cmd, cmd_input=cmd_input)
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 38, in __call__
    return callMethod()
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 36, in <lambda>
    **kwargs)
  File "<string>", line 2, in managedvolume_run_helper
  File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772, in _callmethod
    raise convert_to_error(kind, result)
vdsm.storage.exception.ManagedVolumeHelperFailed: Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/bin/python3\', \'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=b\'\' err=b\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmpezistr1k/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 2922225\\nTraceback (most recent call last):\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 139, in <module>\\n    sys.exit(main(sys.argv[1:]))\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 62, in main\\n    args.command(args)\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 134, in detach\\n    ignore_errors=False)\\n  File "/usr/lib/python3.6/site-packages/vdsm/storage/nos_brick.py", line 110, in disconnect_volume\\n    run_as_root=True)\\n  File "/usr/lib/python3.6/site-packages/os_brick/executor.py", line 53, in _execute\\n    result = self.__execute(*args, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/os_brick/privileged/rootwrap.py", line 172, in execute\\n    return execute_root(*cmd, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 272, in _wrap\\n    r_call_timeout)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 215, in remote_call\\n    raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/vmstore/volume-37800cff-ec62-4d84-9f52-6a179fde1472 --conf /tmp/brickrbd_kchs7qub --id vmstore --mon_host 192.168.8.75:6789 --mon_host 192.168.8.76:6789 --mon_host 192.168.8:6789\\nExit code: 16\\nStdout: \\\'\\\'\\nStderr: \\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',)

After this I believe the VMs failed and they were restarted on another host manually. But there were several still running on this host. I was able to successfully migrate them off to another host later. I rebooted that host and was able to migrate other VMs to and from it without issues, so I iniitially though that was the end of the problem, and left it running with a small number of VMs to see if it reoccurs.

After this I was manually rebalancing some of the VMs across other hosts and found that some of them fail to migrate between certain hosts with the same error, but can migrate to others. These are VMs that were never on the host that had the initial problem.

Connectivity looks good. I show no disconnections on any network interface, and only a handfull of packet drops on the host side on my storage network (not increasing, and likely related to the storage network being set up after the host booted).

I googled for the error and found this previous occurance:

https://bugzilla.redhat.com/show_bug.cgi?id=1755801 https://lists.ovirt.org/archives/list/[email protected]/thread/PVGTQPXCTEQI4LUUSXDRLSIH3GXXQC2N/?sort=date

This sounds like either a bug that was presumed fixed, or an issue due to an external mount of the same volume. The latter is definitely not the case here, we use a dedicated ceph RBD volume for oVirt, along with a dedicated cephfs volume for the primary storage domain (including some of our VMs which we haven't yet migrated to RBD since it's currently a long manual process involving VM downtime).

Feb 07 '23 07:02 ebyrne242

vdsm vdsm copied to clipboard

Live migration failures after upgrading from 4.5.2 to 4.5.4

vdsm
vdsm copied to clipboard