UTM
UTM copied to clipboard
fix: Add workarounds to prevent filesystem corruption on Apple Virtualization VMs.
This is a workaround for #4840. Apple introduced a new VZNVMExpressControllerDeviceConfiguration
in macOS 14 Sonoma's Virtualization Framework, emulating an NVMe disk for Linux guests. This emulated disk is slower than the virtio disk from my initial testing, but it does not have the filesystem corruption issue present on the default virtio interface in most Linux kernels. macOS guests cannot use this device configuration, but they do not experience the filesystem corruption issue in the first place. On macOS 12+ hosts that cannot use VZNVMExpressControllerDeviceConfiguration
, the cache mode of virtio drives in Linux VMs are changed to .cached
because it is verified to also fix the filesystem corruption: https://github.com/utmapp/UTM/issues/4840#issuecomment-1824340975
This PR:
- Added a configuration in the UI to let users select whether the disk should use the NVMe interface. It only affects Linux VMs; on macOS VMs, this configuration does nothing.
- Uses NVMe as the default for new Linux VMs on macOS 14+
- Added localization for Chinese (Simplified and Traditional) and Japanese.
- On macOS 12+ hosts, the caching mode for virtio drives in Linux VMs is changed to
.cached
This successfully workarounds the file corruption problem. I didn't notice any new issues so far with NVMe.
Updated to include setting the caching mode to .cached for macOS 12 hosts. Now, there is uncertainty about whether to continue using nvme as the default for creating new VMs, as .cached
virtio drives seem to address the filesystem problem and perform better than nvme drives.
I'd say for Linux, we use uncached
when NVMe is available (macOS >= 14) and viritio + cached
. Your PR is spot on.
Edit: ⬆️ was about doing all these behind the scenes or having good defaults for new users. Otherwise if NVMe isn't enabled by default for macOS >= 14 and caching for older macOS, then people will still experience corruption and will have to live with it (and blame Linux) or research what to do. In both cases they will waste their lives on dealing with this.
My reasoning is that it seems that uncached
is the de facto default even on 32 GB RAM systems. I couldn't find any information about the logic behind automatic
and when it switches to cached
vs uncached
. It might be dependent on total RAM. It makes sense to use uncached
as we don't waste host RAM for caching guest disk, which it does by itself already. Thus we avoid double caching of guest disk access. I could imagine there might be workloads that could benefit from host caching, but I think we should keep the defaults as close as possible to the underlying framework.
BTW It might be good idea to let user decide which mode is the best for their use case, but that would have to be an advanced setting with some explanations. The same applies to synchronization modes. But I think that should be first discussed with @osy as UI changes are needed. Looking at QEMU support, where one can adjust almost everything on a very low level, I'd say this is something that would be consistent with the general look and feel of UTM.
I see that NVMe defaults to false
. I think we should have good defaults for new users. Otherwise if NVMe isn't enabled by default for macOS >= 14 and caching for older macOS, then people will still experience corruption and will have to live with it (and blame Linux or UTM) or research what to do. In both cases they will waste their lives dealing with this.
Defaulting to NVMe isn't great because it might render some VMs unbootable. It might be a good idea to figure out how to add more info about NVMe and why it was introduced and pros/cons.
Defaulting to NVMe isn't great because it might render some VMs unbootable. It might be a good idea to figure out how to add more info about NVMe and why it was introduced and pros/cons.
This PR does not change the options of existing disks. Instead, when a user creates a new VM, the new disk created with this VM will be set as an NVMe drive on macOS 14+ hosts by the following lines in the wizard:
https://github.com/utmapp/UTM/blob/6c02a1a360ec2f528e1b967bb87071c3d209afc4/Platform/Shared/VMWizardState.swift#L256-L259
This is what I meant by "default". Not magically using NVMe for every disk, but to decide what to use for newly created VMs.
Thanks, this makes sense.
I did some more testing and for some workloads, like unit tests, I see about 15-20% slowdown with NVMe drive which is getting significant for my use case. I'm going to run my VMs in virtio mode with host caching.
Now I'm thinking it again on using nvme as default for new VMs. @wpiekutowski did you experience any issues with forced cached mode on virtio yet? If the virtio is good enough then probably we can default to that for better performance.
I haven't used UTM much last week, so can't say one way or another. stress-ng
and parallel cp
work fine with caching + viritio, so probably that's a good compromise between stability and performance. We might want to add a note somewhere about the fact that host disk image caching is enabled with Linux guests when NVMe is off.
NVME disk works fine on linux kernel 5.x (ubuntu 22.04.3) , and always fail on macOS Sonoma 14.2 debian 12.4.0 linux kernel 6.1
but works on caching + viritio, this is a new problem.
When I tested with Fedora, it was using the upstream Fedora kernel (the 6.5 branch kernel) and did not exhibit this issue. It now appears that macOS 14.2 might have modified the NVMe device. Experiencing a straightforward NVMe I/O error is something we haven't encountered before. I'm re-testing in the same environment using stress-ng to see if I can reproduce it. It's possible that the NVMe disk also requires the caching mode to be set to cached
now.
Strange, my environment used with pre-14.2 with a fast 10-minute stress-ng cannot reproduce this. @wangqiang1588 does this always occur or does it happens randomly?
Strange, my environment used with pre-14.2 with a fast 10-minute stress-ng cannot reproduce this. @wangqiang1588 does this always occur or does it happens randomly?
I Install debian 12.4.0 twice , it happens twice.
- install debian from iso, give 24GB memory 、 12 cpu kernel 、500GB disk
- send 113 GB android aosp zip file from scp command (success)
- unzip 113GB aosp file ( success )
- repo init -b android-security-11.0.0_r72 (success)
- repo sync -j4 --fail-fast -l ( local sync a lot of files then fails here, twice)
always fails on "repo sync" command
then i change it to caching + viritio ,same steps , success
same steps ubuntu 22.04.3 linux kernel 5.15 nvme disk never failure happens (same aosp zip file)
looks like some nvme driver changed after 5.15 kernel make this failure
On NixOS unstable channel + Linux kernel 6.6.7 using uncached NVMe all my tests are passing. No disk corruption during stress-ng
, parallel cp
or system upgrade.
On NixOS unstable channel + Linux kernel 6.6.7 using uncached NVMe all my tests are passing. No disk corruption during
stress-ng
, parallelcp
or system upgrade.
maybe the number of files make this bug, "repo sync" make a lot of files at very short time (.h, .c ,.cpp,.java ...)
Fedora 39 Test ,not found this bug, looks like only happen on linux kernel 6.1 , let me try to upgrade debian's kernel
today upgrade macOS Sonoma 14.2 to 14.2.1 every thing works again .
linux kernel 6.1 linux kernel 6.5 both works fine on debian 12.4
maybe this is a bug of macOS Sonoma 14.2 .
Can confirm tart works perfectly for me (see reference to cirruslabs/tart above). No disk corruption after two days. Tart uses NVME disks. Host: Sonoma 14.2.1. Guest: Ubuntu 24.04 (Noble Numbat). Kernel 6.6.7.
@gnattu Is this ready to be merged? Can you rebase from the main branch?
@osy I just rebased from main.
One change I've made is that I reverted the nvme as default and prefers virtio with cached mode because there are new reports that the nvme mode is still causing errors in some cases where the virtio + cached mode works just fine.
🎉 can’t wait. Please include this important fix for Linux vm’s in 4.5
Any news on when this might get merged and released?
An early beta release with this fix is much appreciated as some other projects did. Otherwise it is practically not possible to use a stable linux vm with apple virtualization, no?
For reference this seems to be solved here some time ago https://github.com/lima-vm/lima/pull/2026
And there seems to be no issues about corruption after that.
@osy I apologize for pinging, do you think this could be merged? UTM experience is just so buggy without this.
This will be included in the next release. If you need this now, you can follow the build instructions along with the prebuilt dependencies. Having multiple people ask about a release isn't going to speed it up. If you want to help maintain this project please let me know.
This will be included in the next release. If you need this now, you can follow the build instructions along with the prebuilt dependencies. Having multiple people ask about a release isn't going to speed it up. If you want to help maintain this project please let me know.
I'm already using a version built by myself with the patch, but I need to keep it up-to-date and it isn't signed.
As I said, I apologize for pinging, I just wanted to make sure that this didn't accidentaly slip your mind. I understand the effort an open source maintainer needs to put into his projects, so, thank you! No claim or demand on my part.
If I can help with testing this (and reporting my test findings) I'll do it. But I don't think I have the bandwidth to become a maintainer for this project.
I guess this is a fairly helpfull issue. People put info, solution and test. I dont think there is a room for apologize since this is a corruption issue, there is a fix and there are sincere people trying to be helpful. Since this is 4 month old fix and it is we people tend to forget among other things people are just trying to land a kindly reminder.