lima icon indicating copy to clipboard operation
lima copied to clipboard

Disk performance with VZ

Open jandubois opened this issue 4 months ago • 14 comments

Description

From https://github.com/ddev/ddev/issues/4466#issuecomment-1361261185:

@75th I believe the poor performance in vz is related to how persistent disk are attached in lima. I will update this in lima once https://github.com/Code-Hex/vz/pull/117 is merged.

Temporarily if you are interested you can do this workaround and then try importing.

colima ssh
sudo -i
echo 'write through' > /sys/class/block/vda/queue/write_cache

Check this for more https://github.com/docker/roadmap/issues/7#issuecomment-1044018513

@balajiv113 I don't find write_cache mentioned in Lima; has this been addressed in a different way, or is this still an open issue?

I've only copied the comment into a new issue here because the original comment was mentioned on Slack today.

jandubois avatar Aug 05 '25 17:08 jandubois

You might want to mention what part is about virtiofs host mounts, and what part is about virtio local disks.

I just mean that "disk performance" can be a little ambiguous, I think that we have been mixing them a bit.

afbjorklund avatar Aug 05 '25 17:08 afbjorklund

I only created the issue because I saw the discussion on Slack with the reference to a comment on a closed issue in a separate repo, and I didn't want it to get lost.

I don't have any further insight into the issue, and don't even know if it may have been fixed in a different way already that does require writing to /sys/class/block/vda/queue/write_cache. Maybe there was a VZ API that allowed it to be configured directly?

jandubois avatar Aug 05 '25 17:08 jandubois

Thanks for creating this issue.

Like I mentioned in Slack, I haven't noticed any performance improvement when enabling write through but what I did notice was that on the macOS side the Virtual Machine Service for limactl process was no longer writing gigabytes of data to disk over time. I believe this is related to the fact that with write through the Virtualization.Framework appears to no longer do fcntl(F_FULLFSYNC) every time like the referenced post on the Docker repo implies.

msimkunas avatar Aug 06 '25 04:08 msimkunas

I think virtualizaton framework is behaving correctly. When the guest calls fsync() the right thing to do on the macOS side is fcntl(F_FULLFSYNC) since fsync() on macOS does nothing.

We should make the default settings safe, and provide options for improved performance for people that don't care about durability.

nirs avatar Aug 06 '25 14:08 nirs

I think virtualizaton framework is behaving correctly. When the guest calls fsync() the right thing to do on the macOS side is fcntl(F_FULLFSYNC) since fsync() on macOS does nothing.

We should make the default settings safe, and provide options for improved performance for people that don't care about durability.

The default behavior is actually much worse for the durability of the SSD because it causes a lot of writes. With larger directory trees, this can easily cause the Virtual Machine Service to write gigabytes of data.

Since Lima is used primarily for local development environments, I think it makes sense to assume that preventing such expensive F_FULLSYNC operations is more important than ensuring crash-safety, which is more of a concern on production environments.

Or, at the very least, it should be a documented option.

msimkunas avatar Aug 06 '25 14:08 msimkunas

I think virtualizaton framework is behaving correctly. When the guest calls fsync() the right thing to do on the macOS side is fcntl(F_FULLFSYNC) since fsync() on macOS does nothing. We should make the default settings safe, and provide options for improved performance for people that don't care about durability.

The default behavior is actually much worse for the durability of the SSD because it causes a lot of writes. With larger directory trees, this can easily cause the Virtual Machine Service to write gigabytes of data.

Since Lima is used primarily for local development environments, I think it makes sense to assume that preventing such expensive F_FULLSYNC operations is more important than ensuring crash-safety, which is more of a concern on production environments.

This is another aspect - for correctness the hypervisor must sync the data to storage and wait until the data reach the storage when the guest ask to sync. Anything else is irresponsible and will lead to data loss.

If the sync is too expensive for your use case you should check why this happens. Maybe the guest is doing unnecessary syncs, or there is a bug in virtiofs in the virtualization framework causing unneeded syncs?

If you don't care about the data you should be able to configure this, but I don't think this is a reasonable default.

nirs avatar Aug 06 '25 19:08 nirs

I'm confused by this discussion. I don't know exactly how the hypervisor comes into this; isn't this a configuration of the guest OS?

From the anecdotal notes above it sounds like the guest OS is frequently doing a full sync unless the cache is configured as a write-through cache, in which case the full sync is not necessary because the disk is already known to be in sync (I assume).

Now, I would have thought that even the full sync would only flush changed data to disk, so I don't understand how this makes a difference on the host. But in either case should the data be safely written, doesn't it?

@nirs Can you explain why enabling the write-through cache would make the data less safe? From my understanding it should have the opposite effect, in making sure all changes are always written directly to disk and not kept in cache.

jandubois avatar Aug 06 '25 19:08 jandubois

Related links:

  • fsync: https://pubs.opengroup.org/onlinepubs/9699919799/
  • qemu caching options: https://doc.opensuse.org/documentation/leap/virtualization/html/book-virtualization/cha-cachemodes.html
  • vz caching mode: https://developer.apple.com/documentation/virtualization/vzdiskimagecachingmode?language=objc
  • vz synchronization mode: https://developer.apple.com/documentation/virtualization/vzdiskimagesynchronizationmode?language=objc

nirs avatar Aug 06 '25 19:08 nirs

There are some comments in our source: https://github.com/lima-vm/lima/blob/cbc0895566c6c506793ca801e3d35838db986220/pkg/driver/vz/vm_darwin.go#L38-L43

I don't have time to crawl into this rabbit hole right now.

jandubois avatar Aug 06 '25 19:08 jandubois

I'm confused by this discussion. I don't know exactly how the hypervisor comes into this; isn't this a configuration of the guest OS?

The hypervisor emulate the block device. The os assumes that the block device behave as it it was a real block device. For example in qemu you can use:

# from qemu(1)

              cache=cache
                     cache is "none", "writeback", "unsafe", "directsync"  or  "writethrough"
                     and  controls how the host cache is used to access block data. This is a
                     shortcut that sets the cache.direct and cache.no-flush  options  (as  in
                     -blockdev),  and  additionally cache.writeback, which provides a default
                     for the write-cache option of block guest devices (as in  -device).  The
                     modes correspond to the following settings:
                       ┌──────────────┬─────────────────┬──────────────┬────────────────┐
                       │              │ cache.writeback │ cache.direct │ cache.no-flush │
                       ├──────────────┼─────────────────┼──────────────┼────────────────┤
                       │ writeback    │ on              │ off          │ off            │
                       ├──────────────┼─────────────────┼──────────────┼────────────────┤
                       │ none         │ on              │ on           │ off            │
                       ├──────────────┼─────────────────┼──────────────┼────────────────┤
                       │ writethrough │ off             │ off          │ off            │
                       ├──────────────┼─────────────────┼──────────────┼────────────────┤
                       │ directsync   │ off             │ on           │ off            │
                       ├──────────────┼─────────────────┼──────────────┼────────────────┤
                       │ unsafe       │ on              │ off          │ on             │
                       └──────────────┴─────────────────┴──────────────┴────────────────┘

                     The default mode is cache=writeback.

The default writeback is use the host page cache (for good performance) and flush data when the guest ask to flush for correctness.

In qemu the write though option is extremly slow, it was enabled by mistake in some cases and replacing with the default was a great improement. https://github.com/qemu/qemu/commit/09615257058a0ae87b837bb041f56f7312d9ead8

From the anecdotal notes above it sounds like the guest OS is frequently doing a full sync unless the cache is configured as a write-through cache, in which case the full sync is not necessary because the disk is already known to be in sync (I assume).

We don't have much data here on the actual, issue. It may be a bug in the guest, or some issue with virtiofs. @msimkunas reported issues of excessive writes when using virtiofs in #lima channel.

@nirs Can you explain why enabling the write-through cache would make the data less safe? From my understanding it should have the opposite effect, in making sure all changes are always written directly to disk and not kept in cache.

Based on linked docker issue, configuring write through cache in the guest prevent syncs. I don't know why. If you disable syncs and the data in the host page cache will be lost if you lose power.

I'm not sure what is are the semantics of the guest write through cache. But the value of the cache comes from the hypervisor emulating the block device. If you change it don't expect the hypervisor to do the right thing.

I think a safer way is to configure the hypervisor to use the right cache and synchronization mode you want.

nirs avatar Aug 06 '25 19:08 nirs

Are there any decent benchmarks posted somewhere? There was some related question in Apple Container:

  • https://github.com/apple/container/discussions/389

But I have no idea what to expect from VZ I/O with virtio or with virtiofs, when compared to native performance?

Something like the comparisons done for virtfs: https://landley.net/kdocs/ols/2010/ols2010-pages-109-120.pdf

afbjorklund avatar Aug 06 '25 20:08 afbjorklund

Are there any decent benchmarks posted somewhere?

It is hard to measure or to find such performance results.

This one is a very useful and include all the details (see the linked git repo):

  • https://developers.redhat.com/articles/2024/09/05/scaling-virtio-blk-disk-io-iothread-virtqueue-mapping#the_problem

I could reproduce the results and get up to 3.75 speedup in fio benchmarks. I did not do any storage performance tests on macOS with qemu or vz.

nirs avatar Aug 06 '25 21:08 nirs

@afbjorklund this one is very old (2018) but contains tons of good content for qemu. https://events19.lfasiallc.com/wp-content/uploads/2017/11/Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf

nirs avatar Aug 06 '25 21:08 nirs

@jandubois As far as i remember once we had that DiskImageSynchronizationModeFsync support it was almost better in terms of performance.

https://github.com/lima-vm/lima/pull/1268

That's why we didn't do the write_through as i wasn't sure of the implications. We can still do both and see a comparison on which is better

balajiv113 avatar Sep 03 '25 05:09 balajiv113