zfs ZFS seems to wear out disk - tremedously poor performance and high 4k I/O mostly on two (of eight) disks

System information

Type	Version/Name
Distribution Name	linux openSuSE
Distribution Version	Leap 15.1
Kernel Version	4.12.14-lp151.28.91-default #1 SM
Architecture	amd64
OpenZFS Version	0.8.5-1

Describe the problem you're observing

I'm seeing absolutely horrendous performance (often below 100 KB/sec) on an encrypted raid-z3 (8 partitions/8 disks) with dedup. (Yes, I've read warnings on dedup and yes, I've probably chosen what is not recommended, but this seems quite several magnitudes worse than it probably should be).

Obviously VM runs very poorly but also file access (via SAMBA) starts to degrade extremely fast - writing a single small file of about 4 MB is ok, writing several brings performance down. A local rsync yielded 2,4 MB/sec on 180 GB of data.

Analysis seems to indicate a "4k access problem": Pool was created with ashift=12 and recordsize is 128k yet there are extremely high numbers of 4k accesses which cluster on only two disks and thus very often max them out. This is somewhat confirmed by

# zpool iostat -vr tank_01 2 | egrep "sync_read|req|^4K|^--" | awk '{ if ( $1 ~ "wwn-.*part1" ) { print gensub("wwn-.*part1", "wwn-disk-" ++nr "-part1", "g", $0) } else { print $0 } }'   
tank_01                             sync_read    sync_write    async_read    async_write      scrub         trim    
req_size                            ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
--------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
4K                                 271M      0  7.12M      0  57.2M      0  1.15G      0    439      0      0      0
--------------------------------------------------------------------------------------------------------------------
raidz3                              sync_read    sync_write    async_read    async_write      scrub         trim    
req_size                            ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
--------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
4K                                 271M      0  7.03M      0  57.2M      0  1.15G      0    439      0      0      0
--------------------------------------------------------------------------------------------------------------------
wwn-disk-1-part1                    sync_read    sync_write    async_read    async_write      scrub         trim    
req_size                            ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
--------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
4K                                5.27M      0   900K      0   283K      0   139M      0     30      0      0      0
--------------------------------------------------------------------------------------------------------------------
wwn-disk-2-part1                    sync_read    sync_write    async_read    async_write      scrub         trim    
req_size                            ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
--------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
4K                                4.87M      0   900K      0   185K      0   135M      0     18      0      0      0
--------------------------------------------------------------------------------------------------------------------
wwn-disk-3-part1                    sync_read    sync_write    async_read    async_write      scrub         trim    
req_size                            ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
--------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
4K                                6.01M      0   900K      0  19.0M      0   153M      0     60      0      0      0
--------------------------------------------------------------------------------------------------------------------
wwn-disk-4-part1                    sync_read    sync_write    async_read    async_write      scrub         trim    
req_size                            ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
--------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
4K                                 119M      0   900K      0  9.10M      0   140M      0    168      0      0      0
--------------------------------------------------------------------------------------------------------------------
wwn-disk-5-part1                    sync_read    sync_write    async_read    async_write      scrub         trim    
req_size                            ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
--------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
4K                                5.21M      0   900K      0   373K      0   160M      0     15      0      0      0
--------------------------------------------------------------------------------------------------------------------
wwn-disk-6-part1                    sync_read    sync_write    async_read    async_write      scrub         trim    
req_size                            ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
--------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
4K                                4.78M      0   900K      0   167K      0   157M      0     49      0      0      0
--------------------------------------------------------------------------------------------------------------------
wwn-disk-7-part1                    sync_read    sync_write    async_read    async_write      scrub         trim    
req_size                            ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
--------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
4K                                6.91M      0   900K      0  19.0M      0   153M      0     70      0      0      0
--------------------------------------------------------------------------------------------------------------------
wwn-disk-8-part1                    sync_read    sync_write    async_read    async_write      scrub         trim    
req_size                            ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
--------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
4K                                 120M      0   900K      0  9.14M      0   140M      0     29      0      0      0
--------------------------------------------------------------------------------------------------------------------

The output seems to confirm an extraordinary number of ~ 120M 4k reads on disks 4 and 8 vs. about 6 M reads on other drives. It also shows a majority of reads is 4k:

# zpool iostat -vr tank_01

tank_01                             sync_read    sync_write    async_read    async_write      scrub         trim    
req_size                            ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
--------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                   0      0      0      0      0      0      0      0      0      0      0      0
1K                                    0      0      0      0      0      0      0      0      0      0      0      0
2K                                    0      0      0      0      0      0      0      0      0      0      0      0
4K                                 272M      0  7.12M      0  57.2M      0  1.15G      0    439      0      0      0
8K                                19.0M   142K  3.34K      0  91.4M  1.50M  91.8M   318M      1    105      0      0
16K                               8.59M   179K  1.70M    971  52.3M  5.90M  49.9M   179M      1     85      0      0
32K                                   0   128K  30.6K  2.31K      0  7.70M      0   122M      0    102      0      0
64K                                  40  29.1K  47.5K  1.96K      5  6.28M      0  52.5M      0     69      0      0
128K                                 15  15.7K  1.15M  4.14K      0  4.35M  17.1M  11.1M      0     15      0      0
256K                                  0  4.83K      0  3.72K      0  2.10M      0  2.05M      0      0      0      0
512K                                  0  3.88K      0  3.30K      0  1.46M      0   403K      0      0      0      0
1M                                    0  1.01K      0      0      0   191K      0  33.3K      0      0      0      0
2M                                    0      0      0      0      0      0      0      0      0      0      0      0
4M                                    0      0      0      0      0      0      0      0      0      0      0      0
8M                                    0      0      0      0      0      0      0      0      0      0      0      0
16M                                   0      0      0      0      0      0      0      0      0      0      0      0
--------------------------------------------------------------------------------------------------------------------

Describe how to reproduce the problem

The system had 16 GB RAM which have been increased to 48 GB (with no change what so ever). CPU was a Intel i3-8100 and is now i3-9100. It otherwise only runs one (kvm) VM (image is a (raw) file on tank_01).

As it has been about a year of try and error and also several zfs versions (encryption optimizations!) exact history is somewhat difficult to guarantee. I started out creating the pool along the line zpool create -o ashift=12 -o feature@encryption=enabled -O encryption=on -O keyformat=raw -O keylocation=file:///etc/zfs.key -m /media/zfs/tank_01 -o autoexpand=on -O dedup=sha256,verify -O checksum=sha256 -O recordsize=32k tank_01 raidz3 ...

Datasets have been created as clones to a "master" dataset (because of otherwise unusable dedup). Original dataset has been replaced with another (to change encryption to speedier aes-256-gcm and recordsize to 128k), but performance has not improved significantly.

Did add a mirror of two nvme as log device, but no significant change, either.

Include any warning/errors/backtraces from the system logs

Sep 06 '21 12:09 FredFirestone

Converting a 320 GB image via qemu-img took about 30h roughly an average of about 3 MB/sec.

/proc/spl/kstat/zfs/dbgmsg contains

1631126526   zio.c:1969:zio_deadman(): zio_wait waiting for hung I/O to pool 'tank_01'
1631126588   zio.c:1969:zio_deadman(): zio_wait waiting for hung I/O to pool 'tank_01'

Sep 08 '21 18:09 FredFirestone

May be related to #9130 as txg_sync can take long and will effectively stop all other io. Obviously when overwriting data I'm also hit by too many free tasks (#3976), but for me it seems a bit of a side issue. I may modify spa.c to limit this somewhat.

Sep 09 '21 09:09 FredFirestone

In another issue I found a convenient smart report which seems relevant:

ZPOOL_SCRIPTS_AS_ROOT=1 zpool iostat -c smart
                                    capacity     operations     bandwidth 
pool                              alloc   free   read  write   read  write  health  realloc  rep_ucor      cmd_to  temp  pend_sec  off_ucor  ata_err  nvme_err
--------------------------------  -----  -----  -----  -----  -----  -----  ------  -------  --------  ----------  ----  --------  --------  -------  --------
tank_01                           21.2T  33.1T      2     53  16.3K   617K
  raidz3                          21.2T  33.1T      2     53  16.3K   576K
    wwn-disk-1-part1                  -      -      0      7  1.03K  78.6K  PASSED        0         0           0    48         0         0        -         -
    wwn-disk-2-part1                  -      -      0      6    949  70.5K  PASSED        0         0           0    47         0         0        -         -
    wwn-disk-3-part1                  -      -      0      6    930  70.0K  PASSED        0         0           0    46         0         0        -         -
    wwn-disk-4-part1                  -      -      0      6  5.31K  69.7K  PASSED        0         0  4295032833    47         0         0        -         -
    wwn-disk-5-part1                  -      -      0      7  1.04K  78.6K  PASSED        0         0           0    48         0         0        -         -
    wwn-disk-6-part1                  -      -      0      6    971  69.9K  PASSED        0         0           0    48         0         0        -         -
    wwn-disk-7-part1                  -      -      0      6    954  69.5K  PASSED        0         0           0    48         0         0        -         -
    wwn-disk-8-part1                  -      -      0      6  5.24K  69.2K  PASSED        0         0  4295032833    48         0         0        -         -
logs                                  -      -      -      -      -      -
  mirror                           512K  15.5G      0      0      1  41.5K
    nvme0n1p5                         -      -      0      0      0  20.8K  PASSED        -         -           -    49         -         -        -         5
    nvme1n1p5                         -      -      0      0      0  20.8K  PASSED        -         -           -    45         -         -        -         2
--------------------------------  -----  -----  -----  -----  -----  -----  ------  -------  --------  ----------  ----  --------  --------  -------  --------

So it seems these two disks have an issue, whether as cause or effect I can't tell.

Sep 21 '21 20:09 FredFirestone

What made/model are those disks and are they all the same? (not the nvme). You can grab the same counters from smartctl.

Sep 27 '21 05:09 h1z1

Actually there are two different models (ST8.000NM0.105-1VS112, ST8.000VX0.022-2EJ112) involved, but disks 3 to 8 are the same (ST8.000VX0.022-2EJ112). In the list above the command timeout counter is extraordinary high just for disks 4 and 8. These are the very same disks which have about 30 times more 4k reads as the other disks. Now what I'd really like to get a grip on is why it had these massive amounts of 4k reads in the first place: Even if data is compressible it seems not likely to be compressible down to 4k. And it also seems unlikely deduplication kicking in and even if reads should again be larger than 4k. So what could be wrong? Could encryption cause these small reads? But why only on two disks? I just can't figure it out on my own, any help is most appreciated. Amendment: Obviously recordsized data still needs to be distributed over the disks (and parity generated). In my case it would be 128k / (8-3) disks ~ 26k per disk (and lower with compression) maybe even split in ashift (=4K) sized quantities if not aggregated within / beyond records.

Oct 01 '21 18:10 FredFirestone

Assuming the reported values from Smart are accurate I'd say you should start there. Confirm the state of those two drives or take them out of the pool entirely. You can try something like ioping against the device itself and a mounted path to compare at least if it's the drives but FWIW you shouldn't expect much performance from the Skyhawks given they are targeted at a very specific market - dvr. ie write heavy.

Oct 04 '21 01:10 h1z1

I've chosen SkyHawks because they are CMR, without helium, according to specs good for 7x24 and 180 TB/year. Even if they aren't top notch performance drives they should do better then 2,4 MB/sec writing.

I've exported the pool and copied all data from wwn-disk-4-part1 to a new disk. Copying went fine, well above 100 MB/s so sequential read is fine. I've removed wwn-disk-4-part1 and the pool accepted the "mirrored" disk without fuss. Now I have these mysterious reads during writing on the replacement disk. atop shows them as

...
MEM | tot    47.0G | free   42.7G  | cache 838.6M | buff    5.9M | slab  696.1M  | shmem  15.1M | vmbal   0.0M | hptot   0.0M  | hpuse   0.0M |
|
DSK |          sdg | busy     76%  | read     713 | write    269 | KiB/r      4  | KiB/w     28 | MBr/s    0.3 | MBw/s    0.7  | avio 7.78 ms |
DSK |          sdh | busy     75%  | read     705 | write    269 | KiB/r      4  | KiB/w     24 | MBr/s    0.3 | MBw/s    0.6  | avio 7.66 ms |
DSK |          sdb | busy     13%  | read       0 | write    269 | KiB/r      0  | KiB/w     24 | MBr/s    0.0 | MBw/s    0.6  | avio 4.67 ms |
DSK |          sdc | busy     12%  | read       0 | write    269 | KiB/r      0  | KiB/w     24 | MBr/s    0.0 | MBw/s    0.6  | avio 4.46 ms |
DSK |          sda | busy     12%  | read       0 | write    269 | KiB/r      0  | KiB/w     28 | MBr/s    0.0 | MBw/s    0.7  | avio 4.33 ms |
DSK |          sde | busy     12%  | read       0 | write    269 | KiB/r      0  | KiB/w     28 | MBr/s    0.0 | MBw/s    0.7  | avio 4.30 ms |
DSK |          sdf | busy     12%  | read       0 | write    269 | KiB/r      0  | KiB/w     28 | MBr/s    0.0 | MBw/s    0.7  | avio 4.28 ms |
DSK |          sdd | busy     11%  | read       0 | write    268 | KiB/r      0  | KiB/w     28 | MBr/s    0.0 | MBw/s    0.7  | avio 4.04 ms |

Disks generally don't do well on 4k (non sequential) reads, so I'm probably back to square one as I'd guess it's not the disks.

Jan 17 '22 00:01 FredFirestone

Issue #2885 also seems related: Larger amounts of data bring write performance down, which is what happens in my case due to the small reads, too.

I just found https://openzfs.org/w/images/8/8d/ZFS_dedup.pdf from Matt Ahrens. The mentioned git https://github.com/ahrens/illumos/tree/dedup seems older, but I'm not sure his suggested improvements made it to openzfs yet.

According to the slides every write generates to roughly 5 writes and may generate random reads if DDT got evicted. As the issue is much more pronounced with bigger data maybe the ddt gets evicted in my case thus leading to 4k random reads. That would at least be a plausible theory, even if it does not explain why certain disks seem to get more of these small reads.

Jan 22 '22 09:01 FredFirestone

I just deleted several GB and got again almost continuous "4k" reads on two disks: grafik

So my theory is

DDT got for some reason clustered on these two disks
DDT needs many small IO's
DDT is not in memory (currently arcstats displays arc_meta_used as ~ 5 GB (5432518456), DDT should be around 24 GB) and consequently DDT accesses lead to high numbers of small (i.e. 4K) IO on these disks.

Still it seems strange that only two disks should contain most all of DDT ...

Jan 31 '22 20:01 FredFirestone

While looking further on information about DDT I stumbled on special vdevs (for dedup) which I weren't aware of.

Also I found https://www.truenas.com/community/goto/post?id=604334 for heavy 4k reads (and hint to special vdev, too).

I didn't want to try adding a special vdev (may not be able to remove it and it won't help for already allocated DDT entries anyway) on the fly, so instead I'll try a 64 GB SSD l2arc with secondarycache=metadata as an "approximation" for testing.

Feb 01 '22 01:02 FredFirestone

Even with l2arc copying a larger amount of data is still slow, I'd guess around 10-20 MB/s (considerably higher as the initial 2,4 MB/s).

So contrary to #7971 it seems l2arc does help. I'm seeing mostly 60% and up to 75% l2arc hit rate (dstat --zfs-l2arc) when doing a cp copy (of course after loading l2arc). The two disks under surveillance do less reads (still ~ 100 fold more than the other disks, though, but that would coincide with "verify" of dedup-blocks) but end up being busy at about 80% (where the other disks are around 40%). I take it as a confirmation of the high reads beeing metadatat (probably ddt) related.

A quick glance into vdev_raidz.c didn't get me very far. It does seem however as if the case of metadata ending up always on same disks had been considered and partly taken into account, but seemingly only for raidz. It also seems as if parity were on the first "colums". With raidz3 this could be disks 1-3 and data on disk 4 and if repeated parity on 5-7 and data on 8. At least that would match the readings. But it might be entirely different as zfs is not trivial to understand and I might be looking wrongly in the first place.

It seems as if it would have helped tremedously if Ahrends idea of a log structured ddt were to come alive.

Feb 08 '22 21:02 FredFirestone

i can't believe you've been pushing on this poorly-built system since September. why can't you rebuild it correctly without dedup? raidz3 is also extremely slow. i don't see anything that can be done here for you.

Feb 09 '22 14:02 bghira

Well, I consider dedup a much neglected zfs feature and understand some may view its use as incorrect per se, which I would not agree too. I'm also not too fond of "it's complicated, don't use it".

Many times I read about memory implications of dedup and sometimes a recommendation to ssd's. Only little by little did I unravel more of the performance implications (like "dedup performance sucks" from Ahrends). I would say I've learned something along the way that is valuable to me and may be of help to others.

What is still left open is exactly what makes zfs use two disks significantly more than the other six. So far my theory is, this is DDT needing to be checked and updated for every (even non-deduplicatable) write and seeming to reside on mostly two disks. But why zfs would almost always select the same disks for DDT is not exactly clear to me. And I'd think this to be a strong enough issue to be called a bug.

That's why I opened the bug report and I'd really appreciate any help in getting a better understanding of the cause and would be really glad if a solution could be found - be it a variation in disk selection or a fresh stab at Ahrends suggestions.

Feb 13 '22 15:02 FredFirestone

As of Feb 2022 vdev_raidz_map_alloc in https://github.com/openzfs/zfs/blob/master/module/zfs/vdev_raidz.c states "Divides the IO evenly across all child vdevs" and " We therefore switch the parity every 1MB." being only implemented for raidz. Digging in history it seems zfs started with raidz and only later added raidz2 (https://github.com/illumos/illumos-gate/commit/99653d4ee642c6528e88224f12409a5f23060994#diff-5b2b3aae2d3cb50f4ea623f5773095d55fa1b12d140e65e9526527943a97d30e). The code to avoid "Hot parity disks" seems to have been deemed unnecessary and thus not been extended to raidz2 (nor later to raidz3). Probably because of backward compatibility with the on disk format for raidz it was left in place for raidz.

Currently rr->rr_firstdatacol is set to nparity and the switching takes place via if (rr->rr_firstdatacol == 1 && (zio->io_offset & (1ULL 20))).

Generalization beyond raidz could be done via selection of parity to be switched with data via (zio->io_offset >> 20) % (nparity+1). However this would change on-disk format. Therefore it would probably also require a new feature flag / pool version.

Not sure what to make of it: Fixing it would require changeing on-disk format which is at least difficult but probably close to impossible. But could be a corner case depending on number of disks (and dedup) and may be much better avoided by pursuing a log structured DDT as Ahrends suggested.

Sep 19 '22 22:09 FredFirestone

Neither of RAIDZ variants are good for 4KB writes, especially if you have ashift=12, independently of "Hot parity disks" "feature". You could consider dedicated special vdev for dedup data. Or for much better results you should have enough RAM to cache the whole dedup table at once in ARC and so avoid reading it from disk and so caring about what disks are parity and what not.

Sep 20 '22 16:09 amotin

The issue at hand is not a problem of small writes per se, like from a database. It rather is a problem of writes beeing transformed in a plethora of 4k IO which mostly hammer always the same two disks. This last element - not evenly spread load - was the very reason to create this issue.

My analysis in this issue led me to belive, it is caused by DDT-Entries read: They easily fit into one "ashift=12" sector forcing zfs to use three more for parity for a total of 4. This divides evenly for my setup of 8 disks resulting in high (read) IO to only two disks. Had I had 9 disks the issue may well have passed me by.

A dedicated vdev for dedup data certainly could be an option. I haven't known about it originally (see my comment from February 1st https://github.com/openzfs/zfs/issues/12538#issuecomment-1026384334) but it won't help for existing dedup entries and the dedicated vdev may never be removed again. To get it in a comparable redundancy as raidz3 wouldn't be simple in my build (e.g. lack of bays and ports), too.

It never seems as if the DDT (~ 25 GB) would stay in RAM nor in l2arc. It seems too sparsely used to not be evicted, but I may just have too little memory assigned. It would help if there were a way to load DDT explicitly. But then again, it would be even better if it were reduced in size (half seems well possible) and converted to a log structured DDT (https://openzfs.org/w/images/8/8d/ZFS_dedup.pdf).

Nov 07 '22 05:11 FredFirestone

I have some problem... NetData provides such an output...

Mar 22 '23 15:03 truefriend-cz