lxd icon indicating copy to clipboard operation
lxd copied to clipboard

LXC Copy fails with - Error reading migration control source: websocket: close 1000 (normal) when no storage space on target

Open ak2766 opened this issue 2 years ago • 11 comments

Required information

  • Distribution: Ubuntu
  • Distribution version: Ubuntu 22.04.3 LTS

lxc info :: client /var/tmp/lxd/client/lxcinfoclient-sanitized.log

lxc info :: remote /var/tmp/lxd/remote/lxcinforemote-sanitized.log

Issue description

I'm getting a websocket: close 1000 (normal) error for some containers that I'm trying to lxc copy from my laptop to a server. I've tried the following but they all fail with the same websocket error on the remote server:

lxc copy -d eth0,ipv4.address=10.11.12.13 cont1 remote:cont1
lxc copy --mode push -d eth0,ipv4.address=10.11.12.13 cont1 remote:cont1
lxc move -d eth0,ipv4.address=10.11.12.13 cont1 remote:cont1

Steps to reproduce

  1. lxc copy -d eth0,ipv4.address=10.11.12.13 cont1 remote:cont1

Information to attach

  • [ ] Any relevant kernel output (dmesg)
  • [ ] Container log (lxc info NAME --show-log)
  • [X] Container configuration (lxc config show NAME --expanded)
  • [X] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
  • [X] Output of the client with --debug
  • [X] Output of the daemon with --debug (alternatively output of lxc monitor while reproducing the issue)

lxc config show cont1 --expanded /var/tmp/lxd/client/lxcconfigshowcont1expanded-sanitized.log

debug logs on server during copy /var/tmp/lxd/remote/lxccopyremotedaemon-sanitized.log

lxc copy --debug -d eth0,ipv4.address=10.11.12.192 cont1 remote:cont1 2>&1 | tee ~/lxccopyclient.log /var/tmp/lxd/client/lxccopyclient-sanitized.log

lxc monitor 2>&1 | tee lxccopyremote.log /var/tmp/lxd/remote/lxcmonitorremote-sanitized.log

ak2766 avatar Aug 12 '23 04:08 ak2766

I found myself with spare cycles and decided to take a stab at this again. This time around, I opted to use lxc export then lxc import. Interestingly, I found the root cause using this method: No space left on device!

After extending the LVM partition I tried lxc copy again. No more websocket errors.

I believe there needs to be a check on the remote host to ensure it can host the new container about to be copied across in its entirety. This was only a ~3GB container so was getting the websocket error after ~20 minutes. I'd hate to be the guy pushing a mega large container only to get a websocket after several hours. I'm on a 20mpbs uplink.

ak2766 avatar Aug 26 '23 05:08 ak2766

I've read the comment by @stgraber here and I understand that storage backends are going to do things differently. However, it would be a good idea to let the lazy sysadmin who isn't aware of impending remote capacity issues know that this copy might fail with a message such as:

Remote Capacity: 2GB
LXC container size: 2.6GB

You might encounter issues - continue [y|N]:

ak2766 avatar Aug 26 '23 05:08 ak2766

Related to https://github.com/canonical/lxd/issues/11948

tomponline avatar Sep 14 '23 12:09 tomponline

@tomponline Just ran into this error. The error is generic and doens't show what the underlying error is. There is plenty of storage on target so not storage related.

Both target and source are on LXD 5.21/stable channel. Ubuntu 22.04

Jun 14 09:55:19 snail3 lxd.daemon[1210026]: time="2024-06-14T09:55:19Z" level=error msg="Failed migration on target" clusterMoveSourceName= err="Error reading migration control source: websocket: close 1000 (normal)" instance=coco live=false project=default push=false

https://github.com/canonical/lxd/issues/11948

Qubitium avatar Jun 14 '24 10:06 Qubitium

I got this bug and it's severely affecting our workflow.

I don't understand what the underlying cause is: I have planty of terabytes of space on my servers (source and destination) but I get this obscure error and don't know what to do.

This on a simple "lxc copy". It's not a VM but a container.

There is a workaround for that ?

Wyk72 avatar Jun 18 '24 12:06 Wyk72

Without discovering the underlying issue the error itself isn't presenting the problem.

Try using lxc monitor --pretty on source and target machines before doing the lxc copy and see if it reveals the actual issue.

tomponline avatar Jun 18 '24 12:06 tomponline

It seems it's a ZFS issue on the main server i.e. snapshot does not exist.

BTW we are planning to abandon ZFS completely, it's a memory hog and kinda useless with fast NVME storage. It was ok with slow/rusty HDDs, but makes little sense, imho, with (very) fast storage. BTRFS will be our next bet for snapshots and all.

Wyk72 avatar Jun 18 '24 15:06 Wyk72

@tomponline src/dst are not using ZFS (ext4) and even with the pretty errors I am still perplexed. Please check the logs:

This is dst doing a pull: dst executing a lxc copy src:vm

Dst Logs:

DEBUG  [2024-06-18T15:40:47Z] Updated metadata for operation                class=task description="Creating instance" operation=889f5a3d-9944-4fab-b1f2-12ce440cbe33 project=default
INFO   [2024-06-18T15:40:48Z] ID: 889f5a3d-9944-4fab-b1f2-12ce440cbe33, Class: task, Description: Creating instance  CreatedAt="2024-06-18 14:54:39.597460287 +0000 UTC" Err= Location=none MayCancel=false Metadata="map[fs_progress:jetbrain-teamcity2: 264.24GB (95.45MB/s)]" Resources="map[containers:[/1.0/instances/jetbrain-teamcity2] instances:[/1.0/instances/jetbrain-teamcity2]]" Status=Running StatusCode=Running UpdatedAt="2024-06-18 15:40:48.156862359 +0000 UTC"
DEBUG  [2024-06-18T15:40:48Z] Updated metadata for operation                class=task description="Creating instance" operation=889f5a3d-9944-4fab-b1f2-12ce440cbe33 project=default
DEBUG  [2024-06-18T15:40:48Z] Websocket: Sending barrier message            address="100.121.151.39:8443"
DEBUG  [2024-06-18T15:40:48Z] Websocket: Got barrier message                address="100.121.151.39:8443"
DEBUG  [2024-06-18T15:40:48Z] Receiving filesystem volume stopped           driver=dir path=/var/snap/lxd/common/lxd/storage-pools/intel-3.8t-two/containers/jetbrain-teamcity2/ pool=intel-3.8t-two volName=jetbrain-teamcity2
DEBUG  [2024-06-18T15:40:48Z] Migrate receive control monitor finished      instance=jetbrain-teamcity2 instanceType=container project=default
ERROR  [2024-06-18T15:41:07Z] Failed migration on target                    clusterMoveSourceName= err="Error reading migration control source: websocket: close 1000 (normal)" instance=jetbrain-teamcity2 live=false project=default push=false

Src Logs:

DEBUG  [2024-06-18T15:40:48Z] Migrate send transfer finished                instance=jetbrain-teamcity2 instanceType=container project=default
DEBUG  [2024-06-18T15:40:48Z] MigrateInstance finished                      args="&{IndexHeaderVersion:1 Name:jetbrain-teamcity2 Snapshots:[] MigrationType:{FSType:RSYNC Features:[xattrs delete bidirectional]} TrackProgress:true MultiSync:false FinalSync:false Data:<nil> ContentType: AllowInconsistent:false Refresh:false Info:0xc00019e0a0 VolumeOnly:false ClusterMove:false}" driver=dir instance=jetbrain-teamcity2 pool=samsung project=default
INFO   [2024-06-18T15:40:48Z] Migration send stopped                        instance=jetbrain-teamcity2 instanceType=container project=default
DEBUG  [2024-06-18T15:40:48Z] Migrate send control monitor finished         instance=jetbrain-teamcity2 instanceType=container project=default
ERROR  [2024-06-18T15:40:48Z] Failed migration on source                    clusterMoveSourceName= err="Rsync send failed: jetbrain-teamcity2, /var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/: [exit status 23 read unix @lxd/3e892375-278d-4c7e-8f15-bb9384f9954b->@: use of closed network connection] (rsync: [sender] read errors mapping \"/var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/rootfs/raid0/BuildServer/system/artifacts/IOS/Archive (Mac)/17601/.teamcity/logs/buildLog.msg5\": Input/output error (5)\nrsync: [sender] read errors mapping \"/var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/rootfs/raid0/BuildServer/system/artifacts/IOS/Archive (Mac)/17607/.teamcity/logs/buildLog.msg5\": Input/output error (5)\nrsync: [sender] read errors mapping \"/var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/rootfs/raid0/BuildServer/system/artifacts/IOS/Archive (Mac)/17608/.teamcity/logs/buildLog.msg5\": Input/output error (5)\nrsync: [sender] read errors mapping \"/var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/rootfs/raid0/BuildServer/system/artifacts/IOS/Archive (Mac)/17601/.teamcity/logs/buildLog.msg5\": Input/output error (5)\nrsync: [sender] read errors mapping \"/var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/rootfs/raid0/BuildServer/system/artifacts/IOS/Archive (Mac)/17607/.teamcity/logs/buildLog.msg5\": Input/output error (5)\nrsync: [sender] read errors mapping \"/var/snap/lxd/common/lxd/storage-pools/samsung/containers/jetbrain-teamcity2/rootfs/raid0/BuildServer/system/artifacts/IOS/Archive (Mac)/17608/.teamcity/logs/buildLog.msg5\": Input/output error (5)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1338) [sender=3.2.7]\n)" instance=jetbrain-teamcity2 live=false project=default push=false

Qubitium avatar Jun 18 '24 16:06 Qubitium

@Qubitium does this happen with a newly created container or only this particular one?

tomponline avatar Jun 28 '24 08:06 tomponline

@tomponline It only happened with this specific container. I copied over about 10 containers. Only this one had issue. In fact, it had a sister container where the lxd container config are identical and data/os are also nearly identical but the sister containet copied over without issue. So this ruled out lxd config diff as culprit. I also did not see any disk/permission/io errors on the src host in dmesg/syslog.

Qubitium avatar Jun 28 '24 09:06 Qubitium

@tomponline Also based on the progress/logs, the error happened at the VERY END of the copy process where it appeared it has completely copied over the 240GB container and then failed for whatever reason. It did not happen at the beginning or middle but at the very end or near the end of copy if that helps.

Qubitium avatar Jun 28 '24 09:06 Qubitium