UTM fix: Add workarounds to prevent filesystem corruption on Apple Virtualization VMs.

This is a workaround for #4840. Apple introduced a new VZNVMExpressControllerDeviceConfiguration in macOS 14 Sonoma's Virtualization Framework, emulating an NVMe disk for Linux guests. This emulated disk is slower than the virtio disk from my initial testing, but it does not have the filesystem corruption issue present on the default virtio interface in most Linux kernels. macOS guests cannot use this device configuration, but they do not experience the filesystem corruption issue in the first place. On macOS 12+ hosts that cannot use VZNVMExpressControllerDeviceConfiguration, the cache mode of virtio drives in Linux VMs are changed to .cached because it is verified to also fix the filesystem corruption: https://github.com/utmapp/UTM/issues/4840#issuecomment-1824340975

This PR:

Added a configuration in the UI to let users select whether the disk should use the NVMe interface. It only affects Linux VMs; on macOS VMs, this configuration does nothing.
Uses NVMe as the default for new Linux VMs on macOS 14+
Added localization for Chinese (Simplified and Traditional) and Japanese.
On macOS 12+ hosts, the caching mode for virtio drives in Linux VMs is changed to .cached

Nov 23 '23 02:11 gnattu

This successfully workarounds the file corruption problem. I didn't notice any new issues so far with NVMe.

Nov 23 '23 22:11 wpiekutowski

Updated to include setting the caching mode to .cached for macOS 12 hosts. Now, there is uncertainty about whether to continue using nvme as the default for creating new VMs, as .cached virtio drives seem to address the filesystem problem and perform better than nvme drives.

Nov 24 '23 07:11 gnattu

I'd say for Linux, we use uncached when NVMe is available (macOS >= 14) and viritio + cached. Your PR is spot on.

Edit: ⬆️ was about doing all these behind the scenes or having good defaults for new users. Otherwise if NVMe isn't enabled by default for macOS >= 14 and caching for older macOS, then people will still experience corruption and will have to live with it (and blame Linux) or research what to do. In both cases they will waste their lives on dealing with this.

My reasoning is that it seems that uncached is the de facto default even on 32 GB RAM systems. I couldn't find any information about the logic behind automatic and when it switches to cached vs uncached. It might be dependent on total RAM. It makes sense to use uncached as we don't waste host RAM for caching guest disk, which it does by itself already. Thus we avoid double caching of guest disk access. I could imagine there might be workloads that could benefit from host caching, but I think we should keep the defaults as close as possible to the underlying framework.

BTW It might be good idea to let user decide which mode is the best for their use case, but that would have to be an advanced setting with some explanations. The same applies to synchronization modes. But I think that should be first discussed with @osy as UI changes are needed. Looking at QEMU support, where one can adjust almost everything on a very low level, I'd say this is something that would be consistent with the general look and feel of UTM.

Nov 24 '23 10:11 wpiekutowski

I see that NVMe defaults to false. I think we should have good defaults for new users. Otherwise if NVMe isn't enabled by default for macOS >= 14 and caching for older macOS, then people will still experience corruption and will have to live with it (and blame Linux or UTM) or research what to do. In both cases they will waste their lives dealing with this.

Nov 24 '23 10:11 wpiekutowski

Defaulting to NVMe isn't great because it might render some VMs unbootable. It might be a good idea to figure out how to add more info about NVMe and why it was introduced and pros/cons.

Nov 24 '23 10:11 wpiekutowski

Defaulting to NVMe isn't great because it might render some VMs unbootable. It might be a good idea to figure out how to add more info about NVMe and why it was introduced and pros/cons.

This PR does not change the options of existing disks. Instead, when a user creates a new VM, the new disk created with this VM will be set as an NVMe drive on macOS 14+ hosts by the following lines in the wizard:

https://github.com/utmapp/UTM/blob/6c02a1a360ec2f528e1b967bb87071c3d209afc4/Platform/Shared/VMWizardState.swift#L256-L259

This is what I meant by "default". Not magically using NVMe for every disk, but to decide what to use for newly created VMs.

Nov 24 '23 11:11 gnattu

Thanks, this makes sense.

Nov 24 '23 12:11 wpiekutowski

I did some more testing and for some workloads, like unit tests, I see about 15-20% slowdown with NVMe drive which is getting significant for my use case. I'm going to run my VMs in virtio mode with host caching.

Dec 01 '23 16:12 wpiekutowski

Now I'm thinking it again on using nvme as default for new VMs. @wpiekutowski did you experience any issues with forced cached mode on virtio yet? If the virtio is good enough then probably we can default to that for better performance.

Dec 03 '23 13:12 gnattu

I haven't used UTM much last week, so can't say one way or another. stress-ng and parallel cp work fine with caching + viritio, so probably that's a good compromise between stability and performance. We might want to add a note somewhere about the fact that host disk image caching is enabled with Linux guests when NVMe is off.

Dec 11 '23 09:12 wpiekutowski

截屏2023-12-19 09 42 02

NVME disk works fine on linux kernel 5.x (ubuntu 22.04.3) , and always fail on macOS Sonoma 14.2 debian 12.4.0 linux kernel 6.1

but works on caching + viritio, this is a new problem.

Dec 19 '23 01:12 wangqiang1588

When I tested with Fedora, it was using the upstream Fedora kernel (the 6.5 branch kernel) and did not exhibit this issue. It now appears that macOS 14.2 might have modified the NVMe device. Experiencing a straightforward NVMe I/O error is something we haven't encountered before. I'm re-testing in the same environment using stress-ng to see if I can reproduce it. It's possible that the NVMe disk also requires the caching mode to be set to cached now.

Dec 19 '23 04:12 gnattu

Strange, my environment used with pre-14.2 with a fast 10-minute stress-ng cannot reproduce this. @wangqiang1588 does this always occur or does it happens randomly?

Dec 19 '23 04:12 gnattu

Strange, my environment used with pre-14.2 with a fast 10-minute stress-ng cannot reproduce this. @wangqiang1588 does this always occur or does it happens randomly?

I Install debian 12.4.0 twice , it happens twice.

install debian from iso, give 24GB memory 、 12 cpu kernel 、500GB disk
send 113 GB android aosp zip file from scp command (success)
unzip 113GB aosp file ( success )
repo init -b android-security-11.0.0_r72 (success)
repo sync -j4 --fail-fast -l ( local sync a lot of files then fails here, twice)

always fails on "repo sync" command

then i change it to caching + viritio ,same steps , success

same steps ubuntu 22.04.3 linux kernel 5.15 nvme disk never failure happens （same aosp zip file)

looks like some nvme driver changed after 5.15 kernel make this failure

Dec 19 '23 05:12 wangqiang1588

On NixOS unstable channel + Linux kernel 6.6.7 using uncached NVMe all my tests are passing. No disk corruption during stress-ng, parallel cp or system upgrade.

Dec 19 '23 13:12 wpiekutowski

On NixOS unstable channel + Linux kernel 6.6.7 using uncached NVMe all my tests are passing. No disk corruption during stress-ng, parallel cp or system upgrade.

maybe the number of files make this bug, "repo sync" make a lot of files at very short time (.h, .c ,.cpp,.java ...)

Dec 20 '23 00:12 wangqiang1588

Fedora 39 Test ,not found this bug, looks like only happen on linux kernel 6.1 , let me try to upgrade debian's kernel

Dec 20 '23 02:12 wangqiang1588

today upgrade macOS Sonoma 14.2 to 14.2.1 every thing works again .

linux kernel 6.1 linux kernel 6.5 both works fine on debian 12.4

maybe this is a bug of macOS Sonoma 14.2 .

Dec 20 '23 05:12 wangqiang1588

Can confirm tart works perfectly for me (see reference to cirruslabs/tart above). No disk corruption after two days. Tart uses NVME disks. Host: Sonoma 14.2.1. Guest: Ubuntu 24.04 (Noble Numbat). Kernel 6.6.7.

Jan 07 '24 03:01 wrmack

@gnattu Is this ready to be merged? Can you rebase from the main branch?

Feb 26 '24 21:02 osy

@osy I just rebased from main.

One change I've made is that I reverted the nvme as default and prefers virtio with cached mode because there are new reports that the nvme mode is still causing errors in some cases where the virtio + cached mode works just fine.

Mar 30 '24 08:03 gnattu

🎉 can’t wait. Please include this important fix for Linux vm’s in 4.5

Mar 30 '24 10:03 slycke

Any news on when this might get merged and released?

Apr 27 '24 18:04 erazemk

An early beta release with this fix is much appreciated as some other projects did. Otherwise it is practically not possible to use a stable linux vm with apple virtualization, no?

May 12 '24 20:05 gkyildirim

For reference this seems to be solved here some time ago https://github.com/lima-vm/lima/pull/2026

And there seems to be no issues about corruption after that.

May 15 '24 11:05 gkyildirim

@osy I apologize for pinging, do you think this could be merged? UTM experience is just so buggy without this.

May 16 '24 05:05 alanfranz

This will be included in the next release. If you need this now, you can follow the build instructions along with the prebuilt dependencies. Having multiple people ask about a release isn't going to speed it up. If you want to help maintain this project please let me know.

May 16 '24 08:05 osy

This will be included in the next release. If you need this now, you can follow the build instructions along with the prebuilt dependencies. Having multiple people ask about a release isn't going to speed it up. If you want to help maintain this project please let me know.

I'm already using a version built by myself with the patch, but I need to keep it up-to-date and it isn't signed.

As I said, I apologize for pinging, I just wanted to make sure that this didn't accidentaly slip your mind. I understand the effort an open source maintainer needs to put into his projects, so, thank you! No claim or demand on my part.

If I can help with testing this (and reporting my test findings) I'll do it. But I don't think I have the bandwidth to become a maintainer for this project.

May 16 '24 09:05 alanfranz

I guess this is a fairly helpfull issue. People put info, solution and test. I dont think there is a room for apologize since this is a corruption issue, there is a fix and there are sincere people trying to be helpful. Since this is 4 month old fix and it is we people tend to forget among other things people are just trying to land a kindly reminder.

May 16 '24 10:05 gkyildirim

UTM UTM copied to clipboard

fix: Add workarounds to prevent filesystem corruption on Apple Virtualization VMs.

UTM
UTM copied to clipboard