multipass icon indicating copy to clipboard operation
multipass copied to clipboard

[hyperkit] instances potentially corrupted on suspend

Open erny opened this issue 4 years ago • 12 comments

Describe the bug

After a while after a suspend / wakeup cycle and doing multipass list; multipass shell, the machine freezes. Initially, you can see hyperkit consuming having high CPU usage, but after trying to gain access to the VM instance, the system freezes. After reboot, multipass froze again, with no VM started.

To Reproduce

Still don't know.

Expected behavior

multipass shell should be able to access the instance or multipass stop should stop the VM uncleanly if instance does not respond.

Logs

Additional info

  • OS: macOS 10.14.6
  • multipass version: multipass 1.5.0+mac, multipassd 1.5.0+mac
  • multipass info --all:
Name:           primary
State:          Stopped
IPv4:           --
Release:        --
Image hash:     da9122a1174f (Ubuntu 20.04 LTS)
Load:           --
Disk usage:     --
Memory usage:   --
Mounts:         /Users/erevilla => Home
                    UID map: 501:default
                    GID map: 20:default

Name:           test
State:          Stopped
IPv4:           --
Release:        --
Image hash:     156e28463db8 (Ubuntu 20.04 LTS)
Load:           --
Disk usage:     --
Memory usage:   --

Additional context

  • Docker Desktop installed and running, which also uses hyperkit.
  • This happened after midnight. I don't know if any special processes like compaction runs. I got several errors.
  • The multipass "primary" instance (which is the instance that has problems) uses lxc / lxd containers.
  • After trying to start again after crash, hyperkit uses 100% of CPU. After a while, I was able to stop the VM with multipass stop and I got: Stopping primary \[2021-01-18T09:23:54.422] [error] [primary] process error occurred Crashed. After a new multipass shell, the instance seems to work again, but on each shutdown I get: "process error occurred Crashed"

erny avatar Jan 18 '21 08:01 erny

Hi @erny, could you verify that, if you stop the instances prior to suspending, you can start them again after resume? I suspect hyperkit doesn't deal well with this and we may need to stop the instances on suspend/start on resume, provided macOS gives us the time to do it properly.

Worst case scenario the instance image may get corrupted.

This in the hyperkit log, would confirm that:

[2021-01-18T00:39:34.484] [error] [primary] hyperkit: [ERROR] write sector = 13823704 length = 102400: I/O deadline exceeded
[2021-01-18T00:39:34.484] [info] [primary] hyperkit: [INFO] (((description "write sector = 13823704 length = 102400")
[2021-01-18T00:39:34.484] [error] [primary] (locks
[2021-01-18T00:39:34.484] [error] [primary] (((description "cluster 100586") (mode Read) (released false))
[2021-01-18T00:39:34.485] [error] [primary] ((description "cluster 21024") (mode Read) (released false)))))
[2021-01-18T00:39:34.485] [error] [primary] ((description "Anonymous client 31073704")
[2021-01-18T00:39:34.485] [error] [primary] (locks (((description "cluster 515478") (mode Write) (released false)))))
[2021-01-18T00:39:34.485] [error] [primary] ((description "Anonymous client 31073703")
[2021-01-18T00:39:34.485] [error] [primary] (locks (((description "cluster 515587") (mode Read) (released false))))))
[2021-01-18T00:48:27.488] [debug] [sshfs-mounts] No mounts to stop for instance "test"

Saviq avatar Jan 18 '21 08:01 Saviq

@Saviq : Thanks for your quick response. Ok, I'll do it the next time. It would be nice if we could pause / hibernate the instance to avoid a complete reboot.

I'm not sure I the image (40GB, 4 CPU, 6MB RAM) will be usable anymore (no usable output in primary log), but I still get new messages in multipassd.log.

erny avatar Jan 18 '21 08:01 erny

Indeed, the image seems to have gotten corrupted and without some manual recovery (I proposed a more friendly solution in #1910), it won't boot:

[60767.601955] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[60767.609428] rcu: 	0-...!: (5 GPs behind) idle=9fe/0/0x1 softirq=1136374/1136374 fqs=1 
[60767.610870] rcu: 	3-...!: (1 GPs behind) idle=7e6/0/0x1 softirq=1161464/1161464 fqs=1 
[60767.672761] rcu: rcu_sched kthread starved for 15305 jiffies! g4121653 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
[60767.674695] rcu: RCU grace-period kthread stack dump:
[60967.284803] ata1.00: revalidation failed (errno=-5)
[60977.789136] ata1.00: revalidation failed (errno=-5)
[61008.755356] ata1.00: revalidation failed (errno=-5)

Saviq avatar Jan 18 '21 08:01 Saviq

@Saviq: I don't know what happened after reboot, but hyperkit got 100% CPU usage for a while. After a new stop/start cycle, the VM came back.

Anyway, is there a way to check the disk image? Thanks again!

erny avatar Jan 18 '21 08:01 erny

@erny qcow-tool has check and repair commands that could help:

COMMANDS
       check
           check the device for internal consistency

       info
           display general information about a qcow2

       repair
           Regenerate the refcount table in an image

Saviq avatar Jan 18 '21 08:01 Saviq

@Saviq: Is the qemu-img tool which comes with multipass in: /Library/Application Support/com.canonical.multipass/bin/qemu-img not convenient?

I have /Applications/Docker.app/Contents/MacOS/qcow-tool of the last available docker release (3.0.4), which claims to be 1.0.0. Will try with this one.

erny avatar Jan 18 '21 09:01 erny

@erny qcow-tool uses the same underlying QCOW2 library that hyperkit does, so is a safer bet. We've seen incompatibilities between the two tools after it was touched by the hyperkit implementation.

Saviq avatar Jan 18 '21 09:01 Saviq

@Saviq: qcow-tool does not show any errors, but it doesn't seem to scan the whole image. I'm afraid that "repair" could make the image unusable, I read a thread about a bug in qemu when accessing sparse files in APFS (https://bugs.launchpad.net/qemu/+bug/1776920#15)

~Could there be a 32G max image size limit, although multipass allows us to create bigger disks?~ (l tried already with a 40GB disk image writing the whole disk with disk dump and it seems to work ok.)

Would the VirtualBox backend be more stable? I already had to throw away a corrupt VM disk image due to the same behaviour.

Thanks again!

erny avatar Jan 18 '21 09:01 erny

VirtualBox may well be a bit more mature, since it's a separate entity managing the VM processes and they've had time to work out some of the issues. We default to hyperkit so as not to have external dependencies, and the default networking is nicer (with VBox you can't access the instance by IP).

I'll repurpose this issue to track the potential suspend/resume issue.

Saviq avatar Jan 18 '21 10:01 Saviq

This may or may not be the issue I'm also having (1929). How would I go about trying to use qcow or any other tool to repair what appears to be a broken instance?

MattWilcox avatar Jan 19 '21 08:01 MattWilcox

I'm seeing the same issue (on MacOS 11.4 Beta (20F5065a) with multipass 1.6.2-mac ) - I have an instance created a number of weeks ago that worked until a few days ago.

I haven't force powered-off the machine as far as I can remember, though I have done some shutdown/reboots, and only the internal ssd is in use. MBP 2016 15.

I tried to check with

sh-3.2# pwd
/var/root/Library/Application Support/multipassd/vault/instances/ubuntu
sh-3.2# /Applications/Docker.app/Contents/MacOS/qcow-tool check ./ubuntu-20.04-server-cloudimg-amd64.img
qcow-tool: [INFO] image has 504 free sectors and 58946 used sectors
Qcow file seems intact.
Total free blocks: 504
Total used blocks: 58946
sh-3.2# ls
cloud-init-config.iso					ubuntu-20.04-server-cloudimg-amd64-initrd-generic	ubuntu-20.04-server-cloudimg-amd64.img
pty							ubuntu-20.04-server-cloudimg-amd64-vmlinuz-generic
sh-3.2#
/var/root/Library/Application Support/multipassd/vault/instances/ubuntu
sh-3.2# /Applications/Docker.app/Contents/MacOS/qcow-tool repair ./ubuntu-20.04-server-cloudimg-amd64.img
qcow-tool: [INFO] Resized file to 58950 clusters (7545600 sectors)
qcow-tool: [INFO] image has 0 free sectors and 58946 used sectors
qcow-tool: [INFO] Zeroing existing refcount table
qcow-tool: [INFO] Incrementing refcount of the refcount table clusters
qcow-tool: [INFO] Incrementing refcount of the header
qcow-tool: [INFO] Incrementing refcount of the 1s L1 table clusters starting at 21333
qcow-tool: [INFO] Allocator: 58946 used; 0 junk; 0 erased; 0 available; 0 copies; 0 roots; 0 Copying; 0 Copied; 0 Flushed; 0 Referenced; max_cluster = 58949
qcow-tool: [INFO] Allocator: file contains cluster 0 .. 58949 will enlarge file to 0 .. 59461
qcow-tool: [INFO] resize: adding available clusters (Node ((x 58950) (y 59461) (l Empty) (r Empty) (h 1) (cardinal 512)))
qcow-tool: [INFO] Incrementing refcount of the data clusters

However despite doing this I get the same issue before & after ie, in the log:

$ multipass start ubuntu                                                                                                                       [9:49:57]
start failed: The following errors occurred:
ubuntu: timed out waiting for response
sh-3.2# pwd
/Library/Logs/Multipass
sh-3.2# tail ubuntu-hyperkit.log
[   16.657983] cloud-init[839]: Cloud-init v. 21.1-19-gbad84ad4-0ubuntu1~20.04.2 finished at Wed, 12 May 2021 08:25:21 +0000. Datasource DataSourceNoCloud [seed=/dev/sr0][dsmode=net].  Up 16.64 seconds
[  OK  ] Finished Execute cloud user/final scripts.
[  OK  ] Reached target Cloud-init target.

Ubuntu 20.04.2 LTS ubuntu ttyS0

[   61.897079] ata1.00: revalidation failed (errno=-5)
[   72.386019] ata1.00: revalidation failed (errno=-5)
[  103.638766] ata1.00: revalidation failed (errno=-5)
sh-3.2#

I can create and launch a new image ok

planetf1 avatar May 12 '21 08:05 planetf1

I'm experiencing this problem of images that seem to have gotten corrupted. Has anything changed? Is there a way to repair images?

mhucka avatar Jun 15 '22 00:06 mhucka

Since Hyperkit is deprecated and being removed in the next release, I'm closing this. Please feel to reopen or open a new issue if this is still relevant to Multipass today. Thanks!

townsend2010 avatar Jul 19 '23 13:07 townsend2010