lxd icon indicating copy to clipboard operation
lxd copied to clipboard

Support live migration of VMs with attached volumes

Open benoitjpnet opened this issue 1 year ago • 12 comments

Following cluster:

root@mc10:~# lxc cluster ls
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| NAME |            URL            |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE  |      MESSAGE      |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc10 | https://192.168.1.10:8443 | database-leader | x86_64       | default        |             | ONLINE | Fully operational |
|      |                           | database        |              |                |             |        |                   |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc11 | https://192.168.1.11:8443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc12 | https://192.168.1.12:8443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
root@mc10:~# 

I start one VM:

lxc launch ubuntu:22.04 v1 --vm --target mc10

I move it:

root@mc10:~# lxc exec v1 -- uptime
 11:45:17 up 0 min,  0 users,  load average: 0.59, 0.13, 0.04
root@mc10:~# 

root@mc10:~# lxc move v1 --target mc11
Error: Instance move to destination failed: Error transferring instance data: Failed migration on target: Failed getting migration target filesystem connection: websocket: bad handshake
root@mc10:~# 

benoitjpnet avatar Dec 15 '23 11:12 benoitjpnet

Hi @benoitjpnet, it looks like live migration isn't yet enabled on your cluster. You can confirm by checking the LXD daemons error logs using journalctl -u snap.lxd.daemon.

roosterfish avatar Jan 02 '24 08:01 roosterfish

The only error I see is:

Jan 02 08:44:47 mc10 lxd.daemon[2134]: time="2024-01-02T08:44:47Z" level=error msg="Failed migration on target" clusterMoveSourceName=builder err="Failed getting migration target filesystem connection: websocket: bad handshake" instance=builder live=true project=default push=false

This lacks a more explicit error message.

But thank you, I re-read the documentation and I missed:

Set migration.stateful to true on the instance.

Then I am doing lxc move v1 --target mc10 but it is stuck. I guess it is not related to Microcloud though.

benoitjpnet avatar Jan 02 '24 09:01 benoitjpnet

Can you check the logs on both ends (source and target host)? One of them should tell that the migration has to be enabled in the config.

roosterfish avatar Jan 02 '24 09:01 roosterfish

Concerning the stuck part:

Jan 02 08:50:50 mc10 lxd.daemon[2134]: time="2024-01-02T08:50:50Z" level=warning msg="Unable to use virtio-fs for device, using 9p as a fallback" device=builder_var_lib_laminar driver=disk err="Stateful migration unsupported" instance=builder project=default
Jan 02 08:50:50 mc10 lxd.daemon[2134]: time="2024-01-02T08:50:50Z" level=warning msg="Unable to use virtio-fs for config drive, using 9p as a fallback" err="Stateful migration unsupported" instance=builder instanceType=virtual-machine project=default
Jan 02 08:50:51 mc10 lxd.daemon[2134]: time="2024-01-02T08:50:51Z" level=warning msg="Failed reading from state connection" err="read tcp 192.168.1.10:57884->192.168.1.11:8443: use of closed network connection" instance=builder instanceType=virtual-machine project=default

I use Ceph RBD + CephFS and it seems CephFS is not supported for live migration :(

benoitjpnet avatar Jan 02 '24 09:01 benoitjpnet

Can you check the logs on both ends (source and target host)? One of them should tell that the migration has to be enabled in the config.

I was not able to find such logs/messages.

benoitjpnet avatar Jan 02 '24 09:01 benoitjpnet

I was able to reproduce the warnings including the hanging migration. I guess you have added a new CephFS storage pool to the MicroCloud cluster and attached one of its volumes to the v1 instance which you are trying to migrate?

@tomponline this looks to be an error on LXD side when migrating VMs that have a CephFS volume attached. Should we block migration of VMs with attached volumes? At least the error from qemu below kind of indicates that this is not supported. Is that the reason why the DiskVMVirtiofsdStart function returns Stateful migration unsupported?

On the source host you can see the following log messages:

Jan 02 13:11:21 m2 lxd.daemon[7034]: time="2024-01-02T13:11:21Z" level=warning msg="Unable to use virtio-fs for device, using 9p as a fallback" device=vol driver=disk err="Stateful migration unsupported" instance=v1 project=default
Jan 02 13:11:21 m2 lxd.daemon[7034]: time="2024-01-02T13:11:21Z" level=warning msg="Unable to use virtio-fs for config drive, using 9p as a fallback" err="Stateful migration unsupported" instance=v1 instanceType=virtual-machine project=default
...
Jan 02 13:11:50 m2 lxd.daemon[7034]: time="2024-01-02T13:11:50Z" level=error msg="Failed migration on source" clusterMoveSourceName=v1 err="Failed starting state transfer to target: Migration is disabled when VirtFS export path 'NULL' is mounted in the guest using mount_tag 'lxd_vol'" instance=v1 live=true project=default push=false

On the target side:

Jan 02 13:11:50 m1 lxd.daemon[4537]: time="2024-01-02T13:11:50Z" level=warning msg="Unable to use virtio-fs for device, using 9p as a fallback" device=vol driver=disk err="Stateful migration unsupported" instance=v1 project=default
Jan 02 13:11:50 m1 lxd.daemon[4537]: time="2024-01-02T13:11:50Z" level=warning msg="Unable to use virtio-fs for config drive, using 9p as a fallback" err="Stateful migration unsupported" instance=v1 instanceType=virtual-machine project=default
Jan 02 13:11:50 m1 lxd.daemon[4537]: time="2024-01-02T13:11:50Z" level=warning msg="Failed reading from state connection" err="read tcp 10.171.103.8:38154->10.171.103.138:8443: use of closed network connection" instance=v1 instanceType=virtual-machine project=default

roosterfish avatar Jan 02 '24 13:01 roosterfish

I was able to reproduce the warnings including the hanging migration. I guess you have added a new CephFS storage pool to the MicroCloud cluster and attached one of its volumes to the v1 instance which you are trying to migrate?

Correct.

benoitjpnet avatar Jan 02 '24 13:01 benoitjpnet

Thanks @roosterfish @benoitjpnet I have moved this to LXD for triaging.

@benoitjpnet can you confirm that live migration works if there is no volume attached?

tomponline avatar Jan 03 '24 09:01 tomponline

Yes it works.

root@mc10:~# lxc launch ubuntu:22.04 v1 --vm --target mc10 -d root,size=10GiB -d root,size.state=4GiB -c limits.memory=4GiB -c limits.cpu=4 -c migration.stateful=true
Creating v1
Starting v1
root@mc10:~# lxc exec v1 -- uptime
 13:10:21 up 0 min,  0 users,  load average: 0.74, 0.19, 0.06
root@mc10:~# lxc move v1 --target mc11
root@mc10:~# lxc exec v1 -- uptime
 13:10:47 up 0 min,  0 users,  load average: 0.49, 0.17, 0.06
root@mc10:~# 

benoitjpnet avatar Jan 03 '24 13:01 benoitjpnet

@MusicDin please can you evaluate what happens when trying to migrate (both live and non-live modes) a VM with custom volumes attached (filesystem and block types) and identify what does and doesn't work.

I suspect we will need quite a bit of work to add support for live-migrating custom block volumes in remote storage, and that live migrating of VMs with custom local volumes isn't going to work either.

So we are likely going to need to land an improvement to detect incompatible scenarios and return a clear error message, and then potentially add a work item for a future roadmap to improve migration support of custom volumes.

tomponline avatar Jan 08 '24 09:01 tomponline

https://github.com/canonical/lxd/pull/12733 improves the error the user sees in this situation.

tomponline avatar Jan 18 '24 15:01 tomponline

Seems relevant https://github.com/lxc/incus/pull/686

tomponline avatar Jul 01 '24 10:07 tomponline

Hi @boltmark as you're working on some migration work wrt to https://github.com/canonical/lxd/pull/13695 I thought it would also be a good opportunity for you to take a look at fixing this issue considering https://github.com/lxc/incus/pull/686

tomponline avatar Jul 03 '24 08:07 tomponline