zfs severe performance regression on virtual disk migration for qcow2 on ZFS with ZFS 2.1.5

Describe the problem you're observing

on recent proxmox releases which is delivered with zfs 2.1.5 as part of the kernel package, there is a significant slowdown when moving an empty qcow2 virtual disk file from an hdd based pool to any other pool

it seems this issue is related to atime updates.

the problem goes away when setting atime=off or atime=on/relatime=on

Describe how to reproduce the problem

update older proxmox with 5.15.39-1-pve kernel to recent proxmox version (pve-no-subscription or pvetest repo)

Include any warning/errors/backtraces from the system logs

see discussion/analysis at this thread:

https://forum.proxmox.com/threads/weird-disk-write-i-o-pattern-on-source-disks-when-moving-virtual-disk.123639/post-538583

start of thread: https://forum.proxmox.com/threads/weird-disk-write-i-o-pattern-on-source-disks-when-moving-virtual-disk.123639/

not sure what change in zfs could cause this behaviour , maybe https://github.com/openzfs/zfs/pull/13338 ? @rincebrain ?

5.15.39-1-pve with zfs 2.1.4:
time qm disk move 100 scsi0  nvmepool
real    0m19.974s

zpool iostat -r hddpool:
 
hddpool       sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            35      0      0      0      1      0    316      0     10      0      0      0
1K            307      0    144      0     54      0    329      5     41      0      0      0
2K             33      0      0      0      0      0    252     56      2     12      0      0
4K              9      0      0      0      0      2    335     77      0     12      0      0
8K            213      0     12      0      2      5      0    153     12      7      0      0
16K             0      0      0      0      0      5      0     71      0     20      0      0
32K             0      0      0      0      0      2      0      6      0     15      0      0
64K            16      0     32      0      0      0      0      0      0      2      0      0
128K            6      0      0      0      0      0      0      0      0      5      0      0
256K            0      0      0      0      0      0      0      0      0      3      0      0
512K            0      0      0      0      0      0      0      0      0      3      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------

5.15.39-2-pve with zfs 2.1.5:
time qm disk move 100 scsi0  nvmepool
real    1m10.214s

zpool iostat -r hddpool:

hddpool       sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            33      0      0      0      3      0  13.2K      0      7      0      0      0
1K            324      0  5.94K      0     61      0  15.9K      0    171      1      0      0
2K             47      0      0      0      0      5  10.0K  2.62K     11      9      0      0
4K             16      0      0      0      0      3  22.9K  2.94K      8      8      0      0
8K            216      0     12      0      2      4     20  3.73K      1     13      0      0
16K             2      0      0      0      0      7    238  3.04K      0      8      0      0
32K             0      0      0      0      0      0      0    111      0      8      0      0
64K            16      0     32      0      0      0      0     87      0      0      0      0
128K            1      0      0      0      0      0    217      7      0      2      0      0
256K            0      0      0      0      0      0      0     44      0      2      0      0
512K            0      0      0      0      0      0      0      0      0      4      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------

5.15.39-2-pve with zfs 2.1.5 and atime=on/relatime=on:
time qm disk move 100 scsi0  nvmepool
real    0m3.365s


# zpool iostat -r hddpool

hddpool       sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            38      0      0      0      0      0    169      0      5      0      0      0
1K             77      0     80      0     64      0    198      7    113      3      0      0
2K             17      0      0      0      0      3    133     13      4     20      0      0
4K             11      0      0      0      0      7    171     46      4      5      0      0
8K            216      0     12      0      2      5      0     52      5      6      0      0
16K             1      0      0      0      0      5      0     33      0      5      0      0
32K             1      0      0      0      0      1      0      0      0     11      0      0
64K            16      0     32      0      0      0      0      0      0      4      0      0
128K            1      0      0      0      0      0      0      0      0      3      0      0
256K            0      0      0      0      0      0      0      0      0      4      0      0
512K            0      0      0      0      0      0      0      0      0      4      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------

Mar 07 '23 08:03 devZer0

with linux 6.1 kernel and zfs 2.1.9 it seems it's even slower then with 2.1.5 ( 1m27s vs. 3.18s with relatime=on)

Mar 07 '23 11:03 devZer0

Pathology in common with #14512 maybe?

Mar 07 '23 22:03 rincebrain

(Replying here because I don't think commenting on that bug that this one seems unrelated is reasonable after I linked it here.)

It might be unrelated, you'd have to see where you're burning your time, but my speculation went something like "disk images often contain large sparse areas" => "this is a known edge case where trying to manipulate sparse areas on things being regularly updated can cause problems, maybe it's causing problems here too".

You could try flipping the tunable zfs_dmu_offset_next_sync to 0 to see if the behavior improves. It's not dangerous, it just makes things behave as though there's no holes if it would have required a forced TXG sync to think it had a correct view of the holes, so you might end up reading/writing a bunch of zeroes instead of jumping over them if something was using SEEK_HOLE/SEEK_DATA.

Mar 08 '23 08:03 rincebrain

i did strace the qemu-img process but it did not reveal anything usable besides the fact , that i can see a lot of lseek and every seek apparently causing one or more atime updates.

this is a known edge case where trying to manipulate sparse areas on things being regularly updated can cause problems,

you did notice that this is nothing related to WRITE access in any way but only READ access? it also happens if the virtual machine is powered off. so there is no proces writing to the file itself.

apparently , simple reading of the metadata pre-allocated qcow2 file is causing an massive amount of atime updates and how massive this is, also seems to depend from the layout of the qcow2 file.

when the file is being moved forth and back, the problem is gone. apparently , "qemu-img convert" does not preserve what "qemu-img create" did setup initially.

i have found that issue after i copied a virtual machine from an older cluster to a newer one and then moving that file with proxmox gui (i.e. qemu-img) from hdd to ssd was slower then the copy via scp/network.

Mar 08 '23 08:03 devZer0

Hypothetically, it could be something like, you do something to access the VM image while it's idle (reading, not writing, just to be entirely clear), it dirties the file because of the needed atime update, and consequently you end up with the aforementioned feature triggering on SEEK_HOLE/SEEK_DATA and forcing a txg sync because it notices the file is dirty, nevermind in what way.

Mar 08 '23 08:03 rincebrain

You could try flipping the tunable zfs_dmu_offset_next_sync to 0 to see if the behavior improves

it does! it restores performance to pre 2.1.5 behaviour ! (20s instead of 1m20s)

Mar 08 '23 09:03 devZer0

Hypothetically, it could be something like, you do something to access the VM image while it's idle (reading, not writing, just to be entirely clear), it dirties the file because of the needed atime update, and consequently you end up with the aforementioned feature triggering on SEEK_HOLE/SEEK_DATA and forcing a txg sync because it notices the file is dirty, nevermind in what way.

after looking at https://github.com/openzfs/zfs/pull/12724/commits/bb8526e5c09218a27280df2adf90d7bd3dcf844c , i think it's exactly like you tell !

i have no clue how this can be resolved in a sane way.

is there a way to check if dirtying the file was "just" an atime update (which is not worth forcing txg sync) ?

when reading through https://man7.org/linux/man-pages/man2/open.2.html , i think O_NOATIME is no option for open() in qemu-img

Mar 08 '23 10:03 devZer0

for reference: https://github.com/openzfs/zfs/issues/4306

and

https://www.mail-archive.com/[email protected]/msg427170.html

Mar 08 '23 10:03 devZer0

9f6943504aec36f897f814fb7ae5987425436b11 is the commit in 2.1.x that changed our behavior. It was first in 2.1.5.

Mar 08 '23 18:03 ryao

There's actually an open PR to optimize this, https://github.com/openzfs/zfs/pull/13368. There's just some outstanding feedback which needs to be addressed before it can be merged.

Mar 08 '23 21:03 behlendorf

9f69435 is the commit in 2.1.x that changed our behavior. It was first in 2.1.5.

It goes back further and more wind-y than that, I think.

First, there was #11900, where you could get inconsistent hole data if you looked too fast, as I recall. Oopsie. So the logic was modified, though the tunable predates that, and we got #12724. But that had a flaw, so we got #12745, and #12746 because the argument was that if we effectively didn't report holes without that in most cases, the functionality was likely to bitrot and break strangely one day.

And now here we are, with not inconsistent data, but pathological performing, behavior.

I'm curious to see if #13368 will mean we can avoid this penalty in most cases or not in practice..

Mar 09 '23 21:03 rincebrain

apparently patch from matthew ahrens has been approved: https://github.com/openzfs/zfs/pull/13368#pullrequestreview-1339786787

Mar 14 '23 18:03 devZer0

just for reference: https://github.com/openzfs/zfs/pull/14641

Mar 18 '23 16:03 devZer0

apparently patch from matthew ahrens has been approved: #13368 (review)

It has been merged. Is the problem still here?

Mar 18 '23 23:03 ryao

Given that fix(#13368) was reverted in https://github.com/openzfs/zfs/pull/14761 the question is seems actual, is degradation still there? Or did it go away with some other optimizations? Or is there any other ticket which is going to solve that problem?

Mar 31 '24 05:03 ssergiienko

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Apr 26 '25 01:04 stale[bot]

here is another issue with pathological slowness of qcow2 on top of zfs dataset - which apparently does not go away when disabling atime updates or dropping caches before doing performance test (i was wrong in my first post, it did not reliably fix that) . interestingly , using more modern params for creating the qcow2 seems to fix it.

https://bugzilla.proxmox.com/show_bug.cgi?id=7012

Nov 10 '25 16:11 devZer0

apparently patch from matthew ahrens has been approved: #13368 (review)

It has been merged. Is the problem still here?

sorry, i have missed your question @ryao . i have not seen issues for a while. reason may be that there was a similar problem with lseek in samba and i did not do many vm storage migrations in the last time.

anyhow, i think my new finding is interesting i think and i'm not sure if we only should blame zfs here.

Nov 10 '25 16:11 devZer0