Extremely low performance
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Debian |
| Distribution Version | 12.5 |
| Kernel Version | 6.1.0 |
| Architecture | amd64 |
| OpenZFS Version | 2.1.11-1 |
Describe the problem you're observing
I have three pools, one in which I have the root system (SSD, rpool), another in which I have the homes (SSD, ssdpool) and a third one in which I have data I rarely need (HDD, hddpool). Running the command dd if=/dev/zero of=/path/to/fs/tmp.bin bs=1G count=1 oflag=dsync returns 64 MBps on hddpool, and only 6.7 MBps in ssdpool. There is 50% of memory in the system (~4GB available), and the system is doing nothing else. I have compared the properties for fs in both pools, and the difference is that in ssdpool I enabled encryption:
ssdpool/home encryption aes-128-gcm -
ssdpool/home keylocation file:///dev/shm/enc2key local
ssdpool/home keyformat passphrase -
ssdpool/home pbkdf2iters 350000 -
ssdpool/home encryptionroot ssdpool/home -
ssdpool/home keystatus available -
I would expect the SSD pool (on SanDisk_SDSSDP064G) to be way faster than the HDD pool (on a Seagate_ST3500412AS), and to see the speed reduced because of the encryption... but not to the point to get only 10% of the performance of the HDD.
Is this a configuration error on my side, a problem with my hardware, or an issue with ZFS? How can I debug this?
UPDATE: I have created on the same ssdpool an unencrypted ZFS and I am getting the same results:
zfs create -o encryption=off -o mountpoint=/mnt/containers ssdpool/containers
dd if=/dev/zero of=/mnt/containers/tmp.bin bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 179.823 s, 6.0 MB/s
So... seems that the encryption is not the issue. What could else be?
Describe how to reproduce the problem
N/A
Include any warning/errors/backtraces from the system logs
There are none
Did you set ashift when creating the pool? You can check the current value with zdb.
@ErrorNoInternet my apologies: After discussing the matter with the friends from #openzfs in libera, seems that the problem is the disks themselves. I have recently changed the motherboard and processor (the previous ones were 10 years old) without changing the disks... and I guess that those have become now the bottleneck.
I mean, we didn't say that.
We suggested you test them with just a single disk non-ZFS FS or e.g. MD RAID and see if that performs well, since the disks themselves appear to have public benchmarks of performing around 10-20 MB/s with random writes or mixed r/w, which suggests this might not be a ZFS-specific quandry, and also suggested it might be the SATA controller, since ASMedia has less than a sterling history with those, and that you could consider it further based on what you found from trying that.
If it used to run much faster with ZFS before your CPU/MB swap, then maybe it's not the disks.
@rincebrain Thank you for your answer! I have been posing some questions (I still am) to the fio mailing list and this is the results I get when performing a read test on the block device:
fio --name=seqread --numjobs=1 --time_based --runtime=60s --ramp_time=2s --iodepth=8 --ioengine=libaio --direct=1 --verify=0 --group_reporting=1 --bs=1M --rw=read --size=1G --filename=/dev/sdc
seqread: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=8
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=522MiB/s][r=521 IOPS][eta 00m:00s]
seqread: (groupid=0, jobs=1): err= 0: pid=1358233: Sat Apr 6 09:15:42 2024
read: IOPS=517, BW=518MiB/s (543MB/s)(30.3GiB/60015msec)
slat (usec): min=18, max=621, avg=60.01, stdev=28.85
clat (msec): min=10, max=1964, avg=15.39, stdev=19.25
lat (msec): min=10, max=1964, avg=15.45, stdev=19.25
clat percentiles (msec):
| 1.00th=[ 14], 5.00th=[ 14], 10.00th=[ 16], 20.00th=[ 16],
| 30.00th=[ 16], 40.00th=[ 16], 50.00th=[ 16], 60.00th=[ 16],
| 70.00th=[ 16], 80.00th=[ 16], 90.00th=[ 16], 95.00th=[ 16],
| 99.00th=[ 19], 99.50th=[ 25], 99.90th=[ 41], 99.95th=[ 45],
| 99.99th=[ 112]
bw ( KiB/s): min=475136, max=537651, per=100.00%, avg=530505.87, stdev=9794.61, samples=120
iops : min= 464, max= 525, avg=518.02, stdev= 9.66, samples=120
lat (msec) : 20=99.18%, 50=0.81%, 250=0.03%, 2000=0.01%
cpu : usr=0.40%, sys=2.38%, ctx=62144, majf=0, minf=37
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=31061,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
READ: bw=518MiB/s (543MB/s), 518MiB/s-518MiB/s (543MB/s-543MB/s), io=30.3GiB (32.6GB), run=60015-60015msec
Disk stats (read/write):
sdc: ios=32813/233, merge=1/0, ticks=890078/88344, in_queue=978658, util=99.91%
This shows results in line with other benchmarks reported for the same device. If perform the a test going through zfs, this is what I get (note that I have removed the --direct=1 flag, as I was told keeping it would negatively impact zfs benchmark):
dd if=/dev/zero of=/var/cache/test bs=1M count=1024
echo 3 > /proc/sys/vm/drop_caches
fio --name=seqread --numjobs=1 --time_based --runtime=60s --ramp_time=2s --iodepth=8 --ioengine=libaio --verify=0 --group_reporting=1 --bs=1M --rw=read --size=1G --filename=/var/cache/test
seqread: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=8
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][22.0%][eta 03m:54s]
seqread: (groupid=0, jobs=1): err= 0: pid=1383694: Sat Apr 6 09:36:01 2024
read: IOPS=3, BW=3684KiB/s (3772kB/s)(222MiB/61709msec)
slat (msec): min=8, max=4051, avg=299.97, stdev=781.41
clat (usec): min=21, max=12866k, avg=2047602.83, stdev=3001429.46
lat (msec): min=115, max=12946, avg=2343.20, stdev=3192.06
clat percentiles (msec):
| 1.00th=[ 96], 5.00th=[ 118], 10.00th=[ 132], 20.00th=[ 148],
| 30.00th=[ 174], 40.00th=[ 190], 50.00th=[ 211], 60.00th=[ 305],
| 70.00th=[ 1720], 80.00th=[ 5201], 90.00th=[ 6745], 95.00th=[ 8020],
| 99.00th=[10537], 99.50th=[11610], 99.90th=[12818], 99.95th=[12818],
| 99.99th=[12818]
bw ( KiB/s): min= 2048, max=41042, per=100.00%, avg=11539.68, stdev=13612.78, samples=38
iops : min= 2, max= 40, avg=11.26, stdev=13.29, samples=38
lat (usec) : 50=0.47%
lat (msec) : 100=0.93%, 250=57.48%, 500=7.94%, 750=0.93%, 1000=0.93%
lat (msec) : 2000=3.74%, >=2000=30.84%
cpu : usr=0.00%, sys=0.41%, ctx=710, majf=0, minf=37
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=99.5%, 8=0.5%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=214,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
READ: bw=3684KiB/s (3772kB/s), 3684KiB/s-3684KiB/s (3772kB/s-3772kB/s), io=222MiB (233MB), run=61709-61709msec
Considering the hardware and the kernel are the same in both cases, the only difference between the 540 MBps and the 4 MBps is the ZFS. It might be not a bug, but maybe a configuration mistake... but I have been circling around this for days now, and I am not getting any way out. Is there anything you can think I could check?
@ErrorNoInternet Yes, I have set ashift to 12 in all the pools of that system.
Your disks are known for having problems with mixed read/write workloads, and atime defaults to on, so every read is causing a write.
Try turning that off.
You could also compare iostat (not zpool iostat) outputs while running the two tests before any changes to confirm that there are a nonzero number of writes happening as a result, and whether that's the case for the non-ZFS test.)
@rincebrain As soon as I get to home I will confirm this, but I think I have switched atime off. Even in that case, for this test in which I am reading a single 1 GB file... Would atime have such a big impact? As is a single file, this would introduce only one write, right?
EDIT: indeed, I am mounting these datasets as noatime: root@kngnt:~# mount | grep cache rpool/var/cache on /var/cache type zfs (rw,nosuid,nodev,noexec,noatime,xattr,posixacl)
I have observed the following features on that rpool:
rpool feature@empty_bpobj active local
rpool feature@lz4_compress active local
rpool feature@multi_vdev_crash_dump enabled local
rpool feature@spacemap_histogram active local
rpool feature@enabled_txg active local
rpool feature@hole_birth active local
rpool feature@extensible_dataset active local
rpool feature@embedded_data active local
rpool feature@bookmarks enabled local
rpool feature@filesystem_limits enabled local
rpool feature@large_blocks enabled local
rpool feature@large_dnode enabled local
rpool feature@sha512 enabled local
rpool feature@skein enabled local
rpool feature@edonr enabled local
rpool feature@userobj_accounting active local
rpool feature@encryption enabled local
rpool feature@project_quota active local
rpool feature@device_removal enabled local
rpool feature@obsolete_counts enabled local
rpool feature@zpool_checkpoint enabled local
rpool feature@spacemap_v2 active local
rpool feature@allocation_classes enabled local
rpool feature@resilver_defer enabled local
rpool feature@bookmark_v2 enabled local
rpool feature@redaction_bookmarks enabled local
rpool feature@redacted_datasets enabled local
rpool feature@bookmark_written enabled local
rpool feature@log_spacemap active local
rpool feature@livelist active local
rpool feature@device_rebuild enabled local
rpool feature@zstd_compress enabled local
rpool feature@draid enabled local
Could be the culprits of this poor performance the zstd_compress and/or the lz4_compress? I have checked and in the ZFS dataset in that rpool in which I am running the tests, "compression" shows as "disabled".
No.
I would suggest you do what I said and look at what the IO looks like on the disk while you're trying to measure this, as I suspect you will find there's writes going on, and as I said, mixing reads and writes can produce very different results from pure reads or writes, which is why it's a common benchmark pattern.
You are right: I see the systemd-journal writing to that drive (has written about 2 MB in 10' now). I see that 4K random access mixed r/w for that drive benchmark, indeed, ~5-13 MBps. This means that during the read test, which lasts 1´, there's been about 50 4K blocks written, while 262k 4k blocks have been read.
I have done another test: I have created in that same pool a 3GB file, as well as another 3G file in the pool in which there are two SSD striped. Then I have run a read tests over both files, with these parameters: fio --name=seqread --numjobs=1 --time_based --runtime=60s --ramp_time=2s --iodepth=8 --ioengine=libaio --verify=0 --group_reporting=1 --bs=1M --rw=read --size=3G
and the results I get is: pool with 1 SSD: 154 MBps pool with 2 SSD: 30.5 MBps
I guess that the pool with 2 SSD is showing this low performance because on that one I have some docker containers dumping data (few kB every 5/10 secs) which triggers the behavior you explained before.
Then... the conclusion is that I have shitty SSDs, that I should use only for making backups, and have them all replaced?
Update: this is getting more and more weird for the second: I have updated zfs from 2.1.11 to 2.2.3, rebooted and re-run the read test: pool with 1 SSD: 300 MBps pool with 2 SSD: 600 MBps
changing read for write, in the test, this remains consistent with previous measurements: pool with 1 SSD: 3.9 MBps pool with 2 SSD: 4.6 MBps
That would likely be my suggestion, yes.