zfs Scrub makes noticeable impact on normal workloads

System information

Type	Version/Name
Distribution Name	TrueNAS CORE
Distribution Version	13.1
Kernel Version	13.1-RELEASE-p5
Architecture	amd64
OpenZFS Version	2.1.7

Describe the problem you're observing

We are monitoring our TrueNAS system's IOPS performance by running a single write operation each second, and measure its latency. During scrubs, we experience very high latencies, like 5+ seconds. This heavily impacts production applications, and this happens only during scrubs.

Describe how to reproduce the problem

In lab I cannot, howewer, on our production system it happens during scrub sessions.

As a workaround, I lowered vfs.zfs.scan_vdev_limit to 128K from its default of 4M, this mitigates our problem. Howewer, I suspect that our system is not that big that it would need special tuning, and zfs defaults should be suitable for us too.

Were there updates to zfs which may change scrub impact, or do I need to stick to tunables?

Jul 19 '25 08:07 rkojedzinszky

Keep in mind what a scrub is and what it does: A scrub checks the integrity of the pool and, if required and possible, repairs bad data. For this at the very least every block has to be read and for each block its checksum has to be checked. On layouts with parity also the it has to be checked. So a scrub causes overall high load on the drives, the controllers and the system overall. So it's not an easy task chucking along in the background. Also your screenshots show something iSCSI - so along with ZFS itself we also have a network stack and iSCSI on top of it. Maybe that's why you cannot reproduce this on the lab machine because you run local? Overall: There's more than just ZFS at play here and a high load is typical for any raid array because it's an active checking for and repairing of defects. If you have to rely on your setup best would be to figure out how long a scrub takes and check if you have a time window it fits into. Otherwise you may have to rethink about your storage implementation, design a better one and plan for a migration. I doubt there's much ZFS itself can help here for something more general about multi-drive arrays.

Jul 21 '25 15:07 n0xena

Despite 2.1.7 is pretty old and by now out of support, it seems to already include my scrub and I/O scheduler optimizations work to reduce chance of I/O starvation. What I observed there is that disks themselves preferred sequential scrub I/Os to random payload. This is out of ZFS control, and even slowing down the I/O may not always help, but you may look on the scheduler settings like zfs_vdev_nia_delay and zfs_vdev_nia_credit.

Jul 21 '25 18:07 amotin

@amotin thanks!

Now we have the following tunables set:

vfs.zfs.vdev.nia_credit=1 vfs.zfs.vdev.nia_delay=20 vfs.zfs.vdev.scrub_max_active=1 vfs.zfs.scan_vdev_limit=131072 vfs.zfs.vdev.async_read_max_active=1 vfs.zfs.vdev.async_write_max_active=1

With these, and only during scrubs, we rarely still observe stalls. A few hours before it was really noticeable, read and write operations were hanging for a few minutes:

Without a running scrub, we dont have any issues with our NAS. Actually, such outages are rare, I dont really know, how could I catch those, what could I do to investigate. Are there some dtrace scripts or similar which might be helpful during such a period? If eventually I could catch one...

Jul 22 '25 13:07 rkojedzinszky

I can't say anything without deeper stats. Have you looked on disk, CPU and memory stats during those times? Is the NAS responsive in general and stats itself, for example, as another indicator of that is working reliably?

Jul 22 '25 14:07 amotin

My fault, apologies, in this case, simply our monitoring system was down, so basically metrics were not collected. I was confused, sorry again. Howewer, in the past, similar behavior was observed many times, and at first I thought that this is the same symptom. Very sorry, again.

But then, @amotin do you think that the tunables above are conservative enough to reduce impact of a scrub, and prioritize normal IO?

Jul 22 '25 15:07 rkojedzinszky

Conservative? It seems you've turned everything available "up to 11". And "async" knobs are not about scrub -- one is about read-ahead, another about write-back, that I would not do unless you want to dive maximum priority to random reads.

Jul 22 '25 15:07 amotin

"up to 11"? Those max values are higher by default, I've lowered them to 1. What do I understand wrong? So my vdev queue limits are:

# sysctl vfs.zfs.vdev|grep -E "(min|max)_act"
vfs.zfs.vdev.rebuild_min_active: 1
vfs.zfs.vdev.rebuild_max_active: 3
vfs.zfs.vdev.trim_min_active: 1
vfs.zfs.vdev.trim_max_active: 2
vfs.zfs.vdev.sync_write_min_active: 10
vfs.zfs.vdev.sync_write_max_active: 10
vfs.zfs.vdev.sync_read_min_active: 10
vfs.zfs.vdev.sync_read_max_active: 10
vfs.zfs.vdev.scrub_min_active: 1
vfs.zfs.vdev.scrub_max_active: 1
vfs.zfs.vdev.removal_min_active: 1
vfs.zfs.vdev.removal_max_active: 2
vfs.zfs.vdev.initializing_min_active: 1
vfs.zfs.vdev.initializing_max_active: 1
vfs.zfs.vdev.async_write_min_active: 1
vfs.zfs.vdev.async_write_max_active: 1
vfs.zfs.vdev.async_read_min_active: 1
vfs.zfs.vdev.async_read_max_active: 1
vfs.zfs.vdev.max_active: 1000

I hope async affects zfs sends, and it does not matter if they'll get slower a bit.

Jul 22 '25 16:07 rkojedzinszky

I just meant that you've done as much as possible within the current design, IIRC. Not that it is bad.

Jul 22 '25 16:07 amotin

Thank you these tunables helped me as well, although I left the async ones alone.

Ideally scrub and resilver should not share any tunables, as the ideal behaviour should be slow scrubs that dont interfere with interactive i/o, but fast resilvers to get out of degraded state as quick as possible.

Dec 02 '25 08:12 chrcoluk