severe performance regression on virtual disk migration for qcow2 on ZFS with ZFS 2.1.5
Distribution Name | Proxmox PVE Distribution Version | 7.3 Kernel Version | 5.15.39-2-pve and later Architecture | x86_64 OpenZFS Version | 2.1.5
Describe the problem you're observing
on recent proxmox releases which is delivered with zfs 2.1.5 as part of the kernel package, there is a significant slowdown when moving an empty qcow2 virtual disk file from an hdd based pool to any other pool
it seems this issue is related to atime updates.
the problem goes away when setting atime=off or atime=on/relatime=on
Describe how to reproduce the problem
update older proxmox with 5.15.39-1-pve kernel to recent proxmox version (pve-no-subscription or pvetest repo)
Include any warning/errors/backtraces from the system logs
see discussion/analysis at this thread:
https://forum.proxmox.com/threads/weird-disk-write-i-o-pattern-on-source-disks-when-moving-virtual-disk.123639/post-538583
start of thread: https://forum.proxmox.com/threads/weird-disk-write-i-o-pattern-on-source-disks-when-moving-virtual-disk.123639/
not sure what change in zfs could cause this behaviour , maybe https://github.com/openzfs/zfs/pull/13338 ? @rincebrain ?
5.15.39-1-pve with zfs 2.1.4:
time qm disk move 100 scsi0 nvmepool
real 0m19.974s
zpool iostat -r hddpool:
hddpool sync_read sync_write async_read async_write scrub trim
req_size ind agg ind agg ind agg ind agg ind agg ind agg
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
512 35 0 0 0 1 0 316 0 10 0 0 0
1K 307 0 144 0 54 0 329 5 41 0 0 0
2K 33 0 0 0 0 0 252 56 2 12 0 0
4K 9 0 0 0 0 2 335 77 0 12 0 0
8K 213 0 12 0 2 5 0 153 12 7 0 0
16K 0 0 0 0 0 5 0 71 0 20 0 0
32K 0 0 0 0 0 2 0 6 0 15 0 0
64K 16 0 32 0 0 0 0 0 0 2 0 0
128K 6 0 0 0 0 0 0 0 0 5 0 0
256K 0 0 0 0 0 0 0 0 0 3 0 0
512K 0 0 0 0 0 0 0 0 0 3 0 0
1M 0 0 0 0 0 0 0 0 0 0 0 0
2M 0 0 0 0 0 0 0 0 0 0 0 0
4M 0 0 0 0 0 0 0 0 0 0 0 0
8M 0 0 0 0 0 0 0 0 0 0 0 0
16M 0 0 0 0 0 0 0 0 0 0 0 0
----------------------------------------------------------------------------------------------
5.15.39-2-pve with zfs 2.1.5:
time qm disk move 100 scsi0 nvmepool
real 1m10.214s
zpool iostat -r hddpool:
hddpool sync_read sync_write async_read async_write scrub trim
req_size ind agg ind agg ind agg ind agg ind agg ind agg
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
512 33 0 0 0 3 0 13.2K 0 7 0 0 0
1K 324 0 5.94K 0 61 0 15.9K 0 171 1 0 0
2K 47 0 0 0 0 5 10.0K 2.62K 11 9 0 0
4K 16 0 0 0 0 3 22.9K 2.94K 8 8 0 0
8K 216 0 12 0 2 4 20 3.73K 1 13 0 0
16K 2 0 0 0 0 7 238 3.04K 0 8 0 0
32K 0 0 0 0 0 0 0 111 0 8 0 0
64K 16 0 32 0 0 0 0 87 0 0 0 0
128K 1 0 0 0 0 0 217 7 0 2 0 0
256K 0 0 0 0 0 0 0 44 0 2 0 0
512K 0 0 0 0 0 0 0 0 0 4 0 0
1M 0 0 0 0 0 0 0 0 0 0 0 0
2M 0 0 0 0 0 0 0 0 0 0 0 0
4M 0 0 0 0 0 0 0 0 0 0 0 0
8M 0 0 0 0 0 0 0 0 0 0 0 0
16M 0 0 0 0 0 0 0 0 0 0 0 0
----------------------------------------------------------------------------------------------
5.15.39-2-pve with zfs 2.1.5 and atime=on/relatime=on:
time qm disk move 100 scsi0 nvmepool
real 0m3.365s
# zpool iostat -r hddpool
hddpool sync_read sync_write async_read async_write scrub trim
req_size ind agg ind agg ind agg ind agg ind agg ind agg
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
512 38 0 0 0 0 0 169 0 5 0 0 0
1K 77 0 80 0 64 0 198 7 113 3 0 0
2K 17 0 0 0 0 3 133 13 4 20 0 0
4K 11 0 0 0 0 7 171 46 4 5 0 0
8K 216 0 12 0 2 5 0 52 5 6 0 0
16K 1 0 0 0 0 5 0 33 0 5 0 0
32K 1 0 0 0 0 1 0 0 0 11 0 0
64K 16 0 32 0 0 0 0 0 0 4 0 0
128K 1 0 0 0 0 0 0 0 0 3 0 0
256K 0 0 0 0 0 0 0 0 0 4 0 0
512K 0 0 0 0 0 0 0 0 0 4 0 0
1M 0 0 0 0 0 0 0 0 0 0 0 0
2M 0 0 0 0 0 0 0 0 0 0 0 0
4M 0 0 0 0 0 0 0 0 0 0 0 0
8M 0 0 0 0 0 0 0 0 0 0 0 0
16M 0 0 0 0 0 0 0 0 0 0 0 0
----------------------------------------------------------------------------------------------
with linux 6.1 kernel and zfs 2.1.9 it seems it's even slower then with 2.1.5 ( 1m27s vs. 3.18s with relatime=on)
Pathology in common with #14512 maybe?
(Replying here because I don't think commenting on that bug that this one seems unrelated is reasonable after I linked it here.)
It might be unrelated, you'd have to see where you're burning your time, but my speculation went something like "disk images often contain large sparse areas" => "this is a known edge case where trying to manipulate sparse areas on things being regularly updated can cause problems, maybe it's causing problems here too".
You could try flipping the tunable zfs_dmu_offset_next_sync to 0 to see if the behavior improves. It's not dangerous, it just makes things behave as though there's no holes if it would have required a forced TXG sync to think it had a correct view of the holes, so you might end up reading/writing a bunch of zeroes instead of jumping over them if something was using SEEK_HOLE/SEEK_DATA.
i did strace the qemu-img process but it did not reveal anything usable besides the fact , that i can see a lot of lseek and every seek apparently causing one or more atime updates.
this is a known edge case where trying to manipulate sparse areas on things being regularly updated can cause problems,
you did notice that this is nothing related to WRITE access in any way but only READ access? it also happens if the virtual machine is powered off. so there is no proces writing to the file itself.
apparently , simple reading of the metadata pre-allocated qcow2 file is causing an massive amount of atime updates and how massive this is, also seems to depend from the layout of the qcow2 file.
when the file is being moved forth and back, the problem is gone. apparently , "qemu-img convert" does not preserve what "qemu-img create" did setup initially.
i have found that issue after i copied a virtual machine from an older cluster to a newer one and then moving that file with proxmox gui (i.e. qemu-img) from hdd to ssd was slower then the copy via scp/network.
Hypothetically, it could be something like, you do something to access the VM image while it's idle (reading, not writing, just to be entirely clear), it dirties the file because of the needed atime update, and consequently you end up with the aforementioned feature triggering on SEEK_HOLE/SEEK_DATA and forcing a txg sync because it notices the file is dirty, nevermind in what way.
You could try flipping the tunable zfs_dmu_offset_next_sync to 0 to see if the behavior improves
it does! it restores performance to pre 2.1.5 behaviour ! (20s instead of 1m20s)
Hypothetically, it could be something like, you do something to access the VM image while it's idle (reading, not writing, just to be entirely clear), it dirties the file because of the needed atime update, and consequently you end up with the aforementioned feature triggering on SEEK_HOLE/SEEK_DATA and forcing a txg sync because it notices the file is dirty, nevermind in what way.
after looking at https://github.com/openzfs/zfs/pull/12724/commits/bb8526e5c09218a27280df2adf90d7bd3dcf844c , i think it's exactly like you tell !
i have no clue how this can be resolved in a sane way.
is there a way to check if dirtying the file was "just" an atime update (which is not worth forcing txg sync) ?
when reading through https://man7.org/linux/man-pages/man2/open.2.html , i think O_NOATIME is no option for open() in qemu-img
for reference: https://github.com/openzfs/zfs/issues/4306
and
https://www.mail-archive.com/[email protected]/msg427170.html
9f6943504aec36f897f814fb7ae5987425436b11 is the commit in 2.1.x that changed our behavior. It was first in 2.1.5.
There's actually an open PR to optimize this, https://github.com/openzfs/zfs/pull/13368. There's just some outstanding feedback which needs to be addressed before it can be merged.
9f69435 is the commit in 2.1.x that changed our behavior. It was first in 2.1.5.
It goes back further and more wind-y than that, I think.
First, there was #11900, where you could get inconsistent hole data if you looked too fast, as I recall. Oopsie. So the logic was modified, though the tunable predates that, and we got #12724. But that had a flaw, so we got #12745, and #12746 because the argument was that if we effectively didn't report holes without that in most cases, the functionality was likely to bitrot and break strangely one day.
And now here we are, with not inconsistent data, but pathological performing, behavior.
I'm curious to see if #13368 will mean we can avoid this penalty in most cases or not in practice..
apparently patch from matthew ahrens has been approved: https://github.com/openzfs/zfs/pull/13368#pullrequestreview-1339786787
just for reference: https://github.com/openzfs/zfs/pull/14641
apparently patch from matthew ahrens has been approved: #13368 (review)
It has been merged. Is the problem still here?
Given that fix(#13368) was reverted in https://github.com/openzfs/zfs/pull/14761 the question is seems actual, is degradation still there? Or did it go away with some other optimizations? Or is there any other ticket which is going to solve that problem?
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
here is another issue with pathological slowness of qcow2 on top of zfs dataset - which apparently does not go away when disabling atime updates or dropping caches before doing performance test (i was wrong in my first post, it did not reliably fix that) . interestingly , using more modern params for creating the qcow2 seems to fix it.
https://bugzilla.proxmox.com/show_bug.cgi?id=7012
apparently patch from matthew ahrens has been approved: #13368 (review)
It has been merged. Is the problem still here?
sorry, i have missed your question @ryao . i have not seen issues for a while. reason may be that there was a similar problem with lseek in samba and i did not do many vm storage migrations in the last time.
anyhow, i think my new finding is interesting i think and i'm not sure if we only should blame zfs here.