zfs Extremely low performance

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	12.5
Kernel Version	6.1.0
Architecture	amd64
OpenZFS Version	2.1.11-1

Describe the problem you're observing

I have three pools, one in which I have the root system (SSD, rpool), another in which I have the homes (SSD, ssdpool) and a third one in which I have data I rarely need (HDD, hddpool). Running the command dd if=/dev/zero of=/path/to/fs/tmp.bin bs=1G count=1 oflag=dsync returns 64 MBps on hddpool, and only 6.7 MBps in ssdpool. There is 50% of memory in the system (~4GB available), and the system is doing nothing else. I have compared the properties for fs in both pools, and the difference is that in ssdpool I enabled encryption:

ssdpool/home  encryption                  aes-128-gcm                                                  -
ssdpool/home  keylocation                 file:///dev/shm/enc2key                                      local
ssdpool/home  keyformat                   passphrase                                                   -
ssdpool/home  pbkdf2iters                 350000                                                       -
ssdpool/home  encryptionroot              ssdpool/home                                                 -
ssdpool/home  keystatus                   available                                                    -

I would expect the SSD pool (on SanDisk_SDSSDP064G) to be way faster than the HDD pool (on a Seagate_ST3500412AS), and to see the speed reduced because of the encryption... but not to the point to get only 10% of the performance of the HDD.

Is this a configuration error on my side, a problem with my hardware, or an issue with ZFS? How can I debug this?

UPDATE: I have created on the same ssdpool an unencrypted ZFS and I am getting the same results:

zfs create -o encryption=off -o mountpoint=/mnt/containers ssdpool/containers
dd if=/dev/zero of=/mnt/containers/tmp.bin bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 179.823 s, 6.0 MB/s

So... seems that the encryption is not the issue. What could else be?

Describe how to reproduce the problem

N/A

Include any warning/errors/backtraces from the system logs

There are none

Apr 01 '24 12:04 flixman

Did you set ashift when creating the pool? You can check the current value with zdb.

Apr 01 '24 16:04 ErrorNoInternet

@ErrorNoInternet my apologies: After discussing the matter with the friends from #openzfs in libera, seems that the problem is the disks themselves. I have recently changed the motherboard and processor (the previous ones were 10 years old) without changing the disks... and I guess that those have become now the bottleneck.

Apr 01 '24 17:04 flixman

I mean, we didn't say that.

We suggested you test them with just a single disk non-ZFS FS or e.g. MD RAID and see if that performs well, since the disks themselves appear to have public benchmarks of performing around 10-20 MB/s with random writes or mixed r/w, which suggests this might not be a ZFS-specific quandry, and also suggested it might be the SATA controller, since ASMedia has less than a sterling history with those, and that you could consider it further based on what you found from trying that.

If it used to run much faster with ZFS before your CPU/MB swap, then maybe it's not the disks.

Apr 01 '24 20:04 rincebrain

@rincebrain Thank you for your answer! I have been posing some questions (I still am) to the fio mailing list and this is the results I get when performing a read test on the block device:

fio --name=seqread --numjobs=1 --time_based --runtime=60s --ramp_time=2s --iodepth=8 --ioengine=libaio --direct=1 --verify=0 --group_reporting=1 --bs=1M --rw=read --size=1G --filename=/dev/sdc
seqread: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=8
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=522MiB/s][r=521 IOPS][eta 00m:00s]
seqread: (groupid=0, jobs=1): err= 0: pid=1358233: Sat Apr  6 09:15:42 2024
  read: IOPS=517, BW=518MiB/s (543MB/s)(30.3GiB/60015msec)
    slat (usec): min=18, max=621, avg=60.01, stdev=28.85
    clat (msec): min=10, max=1964, avg=15.39, stdev=19.25
     lat (msec): min=10, max=1964, avg=15.45, stdev=19.25
    clat percentiles (msec):
     |  1.00th=[   14],  5.00th=[   14], 10.00th=[   16], 20.00th=[   16],
     | 30.00th=[   16], 40.00th=[   16], 50.00th=[   16], 60.00th=[   16],
     | 70.00th=[   16], 80.00th=[   16], 90.00th=[   16], 95.00th=[   16],
     | 99.00th=[   19], 99.50th=[   25], 99.90th=[   41], 99.95th=[   45],
     | 99.99th=[  112]
   bw (  KiB/s): min=475136, max=537651, per=100.00%, avg=530505.87, stdev=9794.61, samples=120
   iops        : min=  464, max=  525, avg=518.02, stdev= 9.66, samples=120
  lat (msec)   : 20=99.18%, 50=0.81%, 250=0.03%, 2000=0.01%
  cpu          : usr=0.40%, sys=2.38%, ctx=62144, majf=0, minf=37
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=31061,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
   READ: bw=518MiB/s (543MB/s), 518MiB/s-518MiB/s (543MB/s-543MB/s), io=30.3GiB (32.6GB), run=60015-60015msec

Disk stats (read/write):
  sdc: ios=32813/233, merge=1/0, ticks=890078/88344, in_queue=978658, util=99.91%

This shows results in line with other benchmarks reported for the same device. If perform the a test going through zfs, this is what I get (note that I have removed the --direct=1 flag, as I was told keeping it would negatively impact zfs benchmark):

dd if=/dev/zero of=/var/cache/test bs=1M count=1024
echo 3 > /proc/sys/vm/drop_caches
fio --name=seqread --numjobs=1 --time_based --runtime=60s --ramp_time=2s --iodepth=8 --ioengine=libaio --verify=0 --group_reporting=1 --bs=1M --rw=read --size=1G --filename=/var/cache/test
seqread: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=8
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][22.0%][eta 03m:54s]                        
seqread: (groupid=0, jobs=1): err= 0: pid=1383694: Sat Apr  6 09:36:01 2024
  read: IOPS=3, BW=3684KiB/s (3772kB/s)(222MiB/61709msec)
    slat (msec): min=8, max=4051, avg=299.97, stdev=781.41
    clat (usec): min=21, max=12866k, avg=2047602.83, stdev=3001429.46
     lat (msec): min=115, max=12946, avg=2343.20, stdev=3192.06
    clat percentiles (msec):
     |  1.00th=[   96],  5.00th=[  118], 10.00th=[  132], 20.00th=[  148],
     | 30.00th=[  174], 40.00th=[  190], 50.00th=[  211], 60.00th=[  305],
     | 70.00th=[ 1720], 80.00th=[ 5201], 90.00th=[ 6745], 95.00th=[ 8020],
     | 99.00th=[10537], 99.50th=[11610], 99.90th=[12818], 99.95th=[12818],
     | 99.99th=[12818]
   bw (  KiB/s): min= 2048, max=41042, per=100.00%, avg=11539.68, stdev=13612.78, samples=38
   iops        : min=    2, max=   40, avg=11.26, stdev=13.29, samples=38
  lat (usec)   : 50=0.47%
  lat (msec)   : 100=0.93%, 250=57.48%, 500=7.94%, 750=0.93%, 1000=0.93%
  lat (msec)   : 2000=3.74%, >=2000=30.84%
  cpu          : usr=0.00%, sys=0.41%, ctx=710, majf=0, minf=37
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.5%, 8=0.5%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=214,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
   READ: bw=3684KiB/s (3772kB/s), 3684KiB/s-3684KiB/s (3772kB/s-3772kB/s), io=222MiB (233MB), run=61709-61709msec

Considering the hardware and the kernel are the same in both cases, the only difference between the 540 MBps and the 4 MBps is the ZFS. It might be not a bug, but maybe a configuration mistake... but I have been circling around this for days now, and I am not getting any way out. Is there anything you can think I could check?

Apr 06 '24 07:04 flixman

@ErrorNoInternet Yes, I have set ashift to 12 in all the pools of that system.

Apr 06 '24 07:04 flixman

Your disks are known for having problems with mixed read/write workloads, and atime defaults to on, so every read is causing a write.

Try turning that off.

You could also compare iostat (not zpool iostat) outputs while running the two tests before any changes to confirm that there are a nonzero number of writes happening as a result, and whether that's the case for the non-ZFS test.)

Apr 06 '24 08:04 rincebrain

@rincebrain As soon as I get to home I will confirm this, but I think I have switched atime off. Even in that case, for this test in which I am reading a single 1 GB file... Would atime have such a big impact? As is a single file, this would introduce only one write, right?

EDIT: indeed, I am mounting these datasets as noatime: root@kngnt:~# mount | grep cache rpool/var/cache on /var/cache type zfs (rw,nosuid,nodev,noexec,noatime,xattr,posixacl)

I have observed the following features on that rpool:

rpool  feature@empty_bpobj            active                         local
rpool  feature@lz4_compress           active                         local
rpool  feature@multi_vdev_crash_dump  enabled                        local
rpool  feature@spacemap_histogram     active                         local
rpool  feature@enabled_txg            active                         local
rpool  feature@hole_birth             active                         local
rpool  feature@extensible_dataset     active                         local
rpool  feature@embedded_data          active                         local
rpool  feature@bookmarks              enabled                        local
rpool  feature@filesystem_limits      enabled                        local
rpool  feature@large_blocks           enabled                        local
rpool  feature@large_dnode            enabled                        local
rpool  feature@sha512                 enabled                        local
rpool  feature@skein                  enabled                        local
rpool  feature@edonr                  enabled                        local
rpool  feature@userobj_accounting     active                         local
rpool  feature@encryption             enabled                        local
rpool  feature@project_quota          active                         local
rpool  feature@device_removal         enabled                        local
rpool  feature@obsolete_counts        enabled                        local
rpool  feature@zpool_checkpoint       enabled                        local
rpool  feature@spacemap_v2            active                         local
rpool  feature@allocation_classes     enabled                        local
rpool  feature@resilver_defer         enabled                        local
rpool  feature@bookmark_v2            enabled                        local
rpool  feature@redaction_bookmarks    enabled                        local
rpool  feature@redacted_datasets      enabled                        local
rpool  feature@bookmark_written       enabled                        local
rpool  feature@log_spacemap           active                         local
rpool  feature@livelist               active                         local
rpool  feature@device_rebuild         enabled                        local
rpool  feature@zstd_compress          enabled                        local
rpool  feature@draid                  enabled                        local

Could be the culprits of this poor performance the zstd_compress and/or the lz4_compress? I have checked and in the ZFS dataset in that rpool in which I am running the tests, "compression" shows as "disabled".

Apr 06 '24 10:04 flixman

No.

I would suggest you do what I said and look at what the IO looks like on the disk while you're trying to measure this, as I suspect you will find there's writes going on, and as I said, mixing reads and writes can produce very different results from pure reads or writes, which is why it's a common benchmark pattern.

Apr 06 '24 13:04 rincebrain

You are right: I see the systemd-journal writing to that drive (has written about 2 MB in 10' now). I see that 4K random access mixed r/w for that drive benchmark, indeed, ~5-13 MBps. This means that during the read test, which lasts 1´, there's been about 50 4K blocks written, while 262k 4k blocks have been read.

I have done another test: I have created in that same pool a 3GB file, as well as another 3G file in the pool in which there are two SSD striped. Then I have run a read tests over both files, with these parameters: fio --name=seqread --numjobs=1 --time_based --runtime=60s --ramp_time=2s --iodepth=8 --ioengine=libaio --verify=0 --group_reporting=1 --bs=1M --rw=read --size=3G

and the results I get is: pool with 1 SSD: 154 MBps pool with 2 SSD: 30.5 MBps

I guess that the pool with 2 SSD is showing this low performance because on that one I have some docker containers dumping data (few kB every 5/10 secs) which triggers the behavior you explained before.

Then... the conclusion is that I have shitty SSDs, that I should use only for making backups, and have them all replaced?

Update: this is getting more and more weird for the second: I have updated zfs from 2.1.11 to 2.2.3, rebooted and re-run the read test: pool with 1 SSD: 300 MBps pool with 2 SSD: 600 MBps

changing read for write, in the test, this remains consistent with previous measurements: pool with 1 SSD: 3.9 MBps pool with 2 SSD: 4.6 MBps

Apr 06 '24 16:04 flixman

That would likely be my suggestion, yes.

Apr 06 '24 17:04 rincebrain