LXC Copy fails with - Error reading migration control source: websocket: close 1000 (normal) when no storage space on target
Required information
- Distribution: Ubuntu
- Distribution version: Ubuntu 22.04.3 LTS
lxc info :: client /var/tmp/lxd/client/lxcinfoclient-sanitized.log
lxc info :: remote /var/tmp/lxd/remote/lxcinforemote-sanitized.log
Issue description
I'm getting a websocket: close 1000 (normal) error for some containers that I'm trying to lxc copy from my laptop to a server. I've tried the following but they all fail with the same websocket error on the remote server:
lxc copy -d eth0,ipv4.address=10.11.12.13 cont1 remote:cont1
lxc copy --mode push -d eth0,ipv4.address=10.11.12.13 cont1 remote:cont1
lxc move -d eth0,ipv4.address=10.11.12.13 cont1 remote:cont1
Steps to reproduce
- lxc copy -d eth0,ipv4.address=10.11.12.13 cont1 remote:cont1
Information to attach
- [ ] Any relevant kernel output (
dmesg) - [ ] Container log (
lxc info NAME --show-log) - [X] Container configuration (
lxc config show NAME --expanded) - [X] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
- [X] Output of the client with --debug
- [X] Output of the daemon with --debug (alternatively output of
lxc monitorwhile reproducing the issue)
lxc config show cont1 --expanded /var/tmp/lxd/client/lxcconfigshowcont1expanded-sanitized.log
debug logs on server during copy /var/tmp/lxd/remote/lxccopyremotedaemon-sanitized.log
lxc copy --debug -d eth0,ipv4.address=10.11.12.192 cont1 remote:cont1 2>&1 | tee ~/lxccopyclient.log /var/tmp/lxd/client/lxccopyclient-sanitized.log
lxc monitor 2>&1 | tee lxccopyremote.log /var/tmp/lxd/remote/lxcmonitorremote-sanitized.log
I found myself with spare cycles and decided to take a stab at this again. This time around, I opted to use lxc export then lxc import. Interestingly, I found the root cause using this method: No space left on device!
After extending the LVM partition I tried lxc copy again. No more websocket errors.
I believe there needs to be a check on the remote host to ensure it can host the new container about to be copied across in its entirety. This was only a ~3GB container so was getting the websocket error after ~20 minutes. I'd hate to be the guy pushing a mega large container only to get a websocket after several hours. I'm on a 20mpbs uplink.
I've read the comment by @stgraber here and I understand that storage backends are going to do things differently. However, it would be a good idea to let the lazy sysadmin who isn't aware of impending remote capacity issues know that this copy might fail with a message such as:
Remote Capacity: 2GB
LXC container size: 2.6GB
You might encounter issues - continue [y|N]:
Related to https://github.com/canonical/lxd/issues/11948
@tomponline Just ran into this error. The error is generic and doens't show what the underlying error is. There is plenty of storage on target so not storage related.
Both target and source are on LXD 5.21/stable channel. Ubuntu 22.04
Jun 14 09:55:19 snail3 lxd.daemon[1210026]: time="2024-06-14T09:55:19Z" level=error msg="Failed migration on target" clusterMoveSourceName= err="Error reading migration control source: websocket: close 1000 (normal)" instance=coco live=false project=default push=false
https://github.com/canonical/lxd/issues/11948
I got this bug and it's severely affecting our workflow.
I don't understand what the underlying cause is: I have planty of terabytes of space on my servers (source and destination) but I get this obscure error and don't know what to do.
This on a simple "lxc copy". It's not a VM but a container.
There is a workaround for that ?
Without discovering the underlying issue the error itself isn't presenting the problem.
Try using lxc monitor --pretty on source and target machines before doing the lxc copy and see if it reveals the actual issue.
It seems it's a ZFS issue on the main server i.e. snapshot does not exist.
BTW we are planning to abandon ZFS completely, it's a memory hog and kinda useless with fast NVME storage. It was ok with slow/rusty HDDs, but makes little sense, imho, with (very) fast storage. BTRFS will be our next bet for snapshots and all.
@tomponline src/dst are not using ZFS (ext4) and even with the pretty errors I am still perplexed. Please check the logs:
This is dst doing a pull: dst executing a lxc copy src:vm
Dst Logs:
DEBUG [2024-06-18T15:40:47Z] Updated metadata for operation class=task description="Creating instance" operation=889f5a3d-9944-4fab-b1f2-12ce440cbe33 project=default
INFO [2024-06-18T15:40:48Z] ID: 889f5a3d-9944-4fab-b1f2-12ce440cbe33, Class: task, Description: Creating instance CreatedAt="2024-06-18 14:54:39.597460287 +0000 UTC" Err= Location=none MayCancel=false Metadata="map[fs_progress:jetbrain-teamcity2: 264.24GB (95.45MB/s)]" Resources="map[containers:[/1.0/instances/jetbrain-teamcity2] instances:[/1.0/instances/jetbrain-teamcity2]]" Status=Running StatusCode=Running UpdatedAt="2024-06-18 15:40:48.156862359 +0000 UTC"
DEBUG [2024-06-18T15:40:48Z] Updated metadata for operation class=task description="Creating instance" operation=889f5a3d-9944-4fab-b1f2-12ce440cbe33 project=default
DEBUG [2024-06-18T15:40:48Z] Websocket: Sending barrier message address="100.121.151.39:8443"
DEBUG [2024-06-18T15:40:48Z] Websocket: Got barrier message address="100.121.151.39:8443"
DEBUG [2024-06-18T15:40:48Z] Receiving filesystem volume stopped driver=dir path=/var/snap/lxd/common/lxd/storage-pools/intel-3.8t-two/containers/jetbrain-teamcity2/ pool=intel-3.8t-two volName=jetbrain-teamcity2
DEBUG [2024-06-18T15:40:48Z] Migrate receive control monitor finished instance=jetbrain-teamcity2 instanceType=container project=default
ERROR [2024-06-18T15:41:07Z] Failed migration on target clusterMoveSourceName= err="Error reading migration control source: websocket: close 1000 (normal)" instance=jetbrain-teamcity2 live=false project=default push=false
Src Logs:
DEBUG [2024-06-18T15:40:48Z] Migrate send transfer finished instance=jetbrain-teamcity2 instanceType=container project=default
DEBUG [2024-06-18T15:40:48Z] MigrateInstance finished args="&{IndexHeaderVersion:1 Name:jetbrain-teamcity2 Snapshots:[] MigrationType:{FSType:RSYNC Features:[xattrs delete bidirectional]} TrackProgress:true MultiSync:false FinalSync:false Data:<nil> ContentType: AllowInconsistent:false Refresh:false Info:0xc00019e0a0 VolumeOnly:false ClusterMove:false}" driver=dir instance=jetbrain-teamcity2 pool=samsung project=default
INFO [2024-06-18T15:40:48Z] Migration send stopped instance=jetbrain-teamcity2 instanceType=container project=default
DEBUG [2024-06-18T15:40:48Z] Migrate send control monitor finished instance=jetbrain-teamcity2 instanceType=container project=default
ERROR [2024-06-18T15:40:48Z] Failed migration on source clusterMoveSourceName= err="Rsync send failed: jetbrain-teamcity2, /var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/: [exit status 23 read unix @lxd/3e892375-278d-4c7e-8f15-bb9384f9954b->@: use of closed network connection] (rsync: [sender] read errors mapping \"/var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/rootfs/raid0/BuildServer/system/artifacts/IOS/Archive (Mac)/17601/.teamcity/logs/buildLog.msg5\": Input/output error (5)\nrsync: [sender] read errors mapping \"/var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/rootfs/raid0/BuildServer/system/artifacts/IOS/Archive (Mac)/17607/.teamcity/logs/buildLog.msg5\": Input/output error (5)\nrsync: [sender] read errors mapping \"/var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/rootfs/raid0/BuildServer/system/artifacts/IOS/Archive (Mac)/17608/.teamcity/logs/buildLog.msg5\": Input/output error (5)\nrsync: [sender] read errors mapping \"/var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/rootfs/raid0/BuildServer/system/artifacts/IOS/Archive (Mac)/17601/.teamcity/logs/buildLog.msg5\": Input/output error (5)\nrsync: [sender] read errors mapping \"/var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/rootfs/raid0/BuildServer/system/artifacts/IOS/Archive (Mac)/17607/.teamcity/logs/buildLog.msg5\": Input/output error (5)\nrsync: [sender] read errors mapping \"/var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/rootfs/raid0/BuildServer/system/artifacts/IOS/Archive (Mac)/17608/.teamcity/logs/buildLog.msg5\": Input/output error (5)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1338) [sender=3.2.7]\n)" instance=jetbrain-teamcity2 live=false project=default push=false
@Qubitium does this happen with a newly created container or only this particular one?
@tomponline It only happened with this specific container. I copied over about 10 containers. Only this one had issue. In fact, it had a sister container where the lxd container config are identical and data/os are also nearly identical but the sister containet copied over without issue. So this ruled out lxd config diff as culprit. I also did not see any disk/permission/io errors on the src host in dmesg/syslog.
@tomponline Also based on the progress/logs, the error happened at the VERY END of the copy process where it appeared it has completely copied over the 240GB container and then failed for whatever reason. It did not happen at the beginning or middle but at the very end or near the end of copy if that helps.