ZFS doesn't respect Linux kernel CPU isolation mechanisms
System information
| Type | Version/Name |
|---|---|
| Distribution Name | ArchLinux |
| Distribution Version | Rolling |
| Linux Kernel | 4.19.48 |
| Architecture | x86_64 |
| ZFS Version | 0.8.0 |
| SPL Version | 0.8.0 |
Describe the problem you're observing
module/spl/spl-taskq.c contains this code:
tqt->tqt_thread = spl_kthread_create(taskq_thread, tqt,
"%s", tq->tq_name);
if (tqt->tqt_thread == NULL) {
kmem_free(tqt, sizeof (taskq_thread_t));
return (NULL);
}
if (spl_taskq_thread_bind) {
last_used_cpu = (last_used_cpu + 1) % num_online_cpus();
kthread_bind(tqt->tqt_thread, last_used_cpu);
}
Thus, kthreads spawn either with the default cpumask or, if spl_taskq_thread_bind=1 is set on module import, are bound to CPUs without regard for their availability to the scheduler. This can be a substantial source of latency, which is not acceptable on many systems that use the isolcpus boot parameter to isolate designated "real-time" cores.
While spl_taskq_thread_bind=1 prevents latency from thread migration on/off RT CPUs, it can make things substantially worse by locking the threads to arbitrary cores in a way that can't be changed with taskset, leaving the RT CPUs permanently saddled with the kthread for its full lifecycle.
Ideally, the modular CPU selection would be replaced with something that uses the kernel's housekeeping API in include/linux/sched/isolation.h to get the cpumask of non-isolated CPUs and use kthread_create_on_cpu in spl_kthread_create and/or ~~kthread_bind_mask~~ to schedule and bind threads across non-RT cores only. Note, however, this is an incomplete solution because the kernel's interface to get an isolcpus cpumask has changed several times across the versions supported by ZFS.
Various hacks can be done to try to prevent unbound kthreads from using isolated cores, and threads not bound with spl_taskq_thread_bind can be moved, but these solutions are iffy and incomplete at best. It would be great if ZFS respected isolcpus from the start.
Describe how to reproduce the problem
Boot with isolcpus, capture a trace of the RT CPUs with perf sched record or other tracing mechanisms, observe ZFS-spawned kthreads coming on and off isolated cores. This is the primary remaining source of latency on my local system.
Include any warning/errors/backtraces from the system logs
Minimal quick and dirty patch that appears to work for me here: https://github.com/sjuxax/zfs/commit/7c2a8969b4216f00c5742d3656a201109fcc77b3 .
Looks like kthread_bind_mask isn't exported, so using cpumask_next_wrap instead. This iterates through the CPUs, but skips CPUs that have HK_FLAG_DOMAIN specified in the housekeeping API. I'm sure there are other places where the affinity needs to be set, but at a glance, this appears to quiet things down a bit. 🤷♂️
@sjuxax your observations are correct. The other place you would have to do this is in __thread_create() in module/spl/spl-thread.c. You can see a very primitive example here: https://github.com/gamanakis/zfs/commit/b9bad20df375bc0c4245284089ac9df35df2214a
@sjuxax would you mind opening a PR with the proposed fix for taskqs and dedicated threads. Then we can get you some better feedback and shouldn't lose track of this again.
@behlendorf Would it be worth it to have additionally a cpulist as an spl module parameter that would bind those threads to defined cpus?
Has there been any progress on fixing this defect please?
CPU hotpluging code changes the relevant code: https://github.com/openzfs/zfs/pull/11212 This PR will have to implement this changes in hotplug-aware way.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
Has this been fixed?
On my system (zfs-2.1.2-1), using isolcpus and spl_taskq_thread_bind with either 0 or 1 has no effect (ZFS still using a cpu that is excluded by isolcpus).
Interesting,
I saw your reply via email and tried it myself to confirm. I am on Archlinux here using:
- Kernel
5.16.13 zfs-2.1.2-1zfs-kmod-2.1.2-1- AMD Ryzen 9 3900X
- 32G DDR4 @3600MHz ( 2x F4-3600C16-16GTZNC )
- 2TB Corsair MP600 nvme to read from as a test
My boot arguments were:
zfs=myPool/myRoot rw iommu=pt iommu=1 quiet acpi_enforce_resources=lax hugepagesz=1G hugepages=12 isolcpus=3-11,15-23 rcu_nocbs=3-11,15-23 nohz_full=3-11,15-23 systemd.unified_cgroup_hierarchy=0 rcu_nocb_poll irqaffinity=0,1,2,12,13,14
I opened htop on one screen and could already see that only cores 0,1,2 + 12,13,14 were given work by my host.
At this point I used pv /data/somelargefile.dat > /dev/null in another terminal and ZFS read it out at ~1.9gb/s.
I could see the z_rd_init_0 (and incremented) threads giving thread's 0,1,2,12,13,14 the workload of their life. But the other cores were left 100% idle. This wasn't the case before.
I tried another pv from data in an encrypted dataset and while the read speed was expectedly slower, it still only executed on the 6 cpu threads which were not isolated. I don't know why your situation is behaving differently.
Hello, thanks for the quick and detailed reply!
I forgot to mention that I am running NixOS unstable.
I tried to adapt my system as far as possible to your kernel parameters, now I have the following cmdline (hashes and PCI IDs removed for readability):
BOOT_IMAGE=(hd0,gpt2)//kernels/[...]-linux-5.15.27-bzImage init=/nix/store/[...]/init vfio-pci.ids=[...] amd_iommu=on iommu=pt iommu=1 acpi_enforce_resources=lax isolcpus=7-15 rcu_nocbs=7-15 nohz_full=7-15 rcu_nocb_poll irqaffinity=0,1,2,3,4,5,6 spl_taskq_thread_bind=1 nohibernate zfs_force=1 systemd.unified_cgroup_hierarchy=0 loglevel=4
The line spl_taskq_thread_bind=1 does not have any effect it seems, I tried booting once with and without it and it made no difference.
As soon as my system is booted, there is some (less than 2%) kernel activity on core #14, along with userspace activity on 0-6. All other cores are silent.
However, I took a look in htop and zfs is not the only kernel process using that core, so likely something entirely wrong on my part.
So, this is cleary some sort of user error on my part. If you have any suggestions or Ideas, I would of course be very thankful nonetheless. If I find a solution, I will try to post it here too. Thanks!
Okay, so I think I figured it out, although the reasons why it is the way it is are beyond my understanding.
To make a long story short, if I leave a CPU core between 8 and 15 for the kernel, it uses that CPU core, otherwise it will just assign a random one at boot time and be stuck with it. So, if I isolated all CPUs except 0 and 1, CPUs 2-7 would not have any kernel processes, 8-15 would have a random CPU that has the kernel processes. Basically, what I did is followed the example made by @ipaqmaster - just leave some CPUs in the upper range as well, now I can use 2-7 and 10-15 exclusively and I have no more issues, everything is nice and isolated.
Thank you!
I concur with @Jauchi that with isolcpus=0-7, z_trim_int still managed to get scheduled onto cpu 0. I'll dig deeper if I have time.
Just FYI: the isolcpus doesn't work, but it seems some of the other command line args do. I have nohz_full=1-7,16-23 rcu_nocbs=0-7,16-23 irqaffinity=8-15,24-31 rcu_nocb_poll which makes the CPU threads 1-7, 16-23 ZFS-free it seems.
I also use spl.spl_taskq_thread_bind=0 spl.spl_taskq_thread_priority=0 which may matter.
I just tested (on fully updated Ubuntu 24.04) with the config shared by @IvanVolosyuk (with the exception of nohz_full=1-7,17-23 - note the ticks on the sibling hyperthread) and I found ZFS pinned all its threads to core 0, which was bad news. Tried removing nohz_full and it was the same, no difference with spl_taskq_thread_bind set to 0 or 1.
Had to give the box back so ran out of time to test any more, if I can get a box with a similar CPU i'll have another go.
My objective was to only allow ZFS to use cores 8-15 and 24-31 (one of those AMD X3D CPUs with chunky L3 cache on only those cores)
Yeah, I also have 7950x3d CPU. I use this config for a while without issues. My first CCD is running qemu/kvm with realtime priority (except for CPU/thread=0, which is left alone as Linux still schedules something on it). I don't use isolcpus, but I do pin vcpu and io threads in qemu; and interrupt lines in Linux kernel away from the CCD. I can observe ZFS only using second CCD in 'top' when doing heavy zstd compression in ZFS.
If you plan on buying that CPU - I would advice against it if you plan to do kvm+vfio: https://www.reddit.com/r/VFIO/comments/194ndu7/anyone_experiencing_host_random_reboots_using/
How do we get this reopened and/or raise a separate bug? This bug is closed so presumably nobody is looking at it anymore.
@behlendorf can we reopen this as the issue is still exist. I understand that there are some technical / license issues to make it happen.