zfs ZFS doesn't respect Linux kernel CPU isolation mechanisms

System information

Type	Version/Name
Distribution Name	ArchLinux
Distribution Version	Rolling
Linux Kernel	4.19.48
Architecture	x86_64
ZFS Version	0.8.0
SPL Version	0.8.0

Describe the problem you're observing

module/spl/spl-taskq.c contains this code:

  tqt->tqt_thread = spl_kthread_create(taskq_thread, tqt,
      "%s", tq->tq_name);
  if (tqt->tqt_thread == NULL) {
    kmem_free(tqt, sizeof (taskq_thread_t));
    return (NULL);
  }

  if (spl_taskq_thread_bind) {
    last_used_cpu = (last_used_cpu + 1) % num_online_cpus();
    kthread_bind(tqt->tqt_thread, last_used_cpu);
  }

Thus, kthreads spawn either with the default cpumask or, if spl_taskq_thread_bind=1 is set on module import, are bound to CPUs without regard for their availability to the scheduler. This can be a substantial source of latency, which is not acceptable on many systems that use the isolcpus boot parameter to isolate designated "real-time" cores.

While spl_taskq_thread_bind=1 prevents latency from thread migration on/off RT CPUs, it can make things substantially worse by locking the threads to arbitrary cores in a way that can't be changed with taskset, leaving the RT CPUs permanently saddled with the kthread for its full lifecycle.

Ideally, the modular CPU selection would be replaced with something that uses the kernel's housekeeping API in include/linux/sched/isolation.h to get the cpumask of non-isolated CPUs and use kthread_create_on_cpu in spl_kthread_create and/or ~~kthread_bind_mask~~ to schedule and bind threads across non-RT cores only. Note, however, this is an incomplete solution because the kernel's interface to get an isolcpus cpumask has changed several times across the versions supported by ZFS.

Various hacks can be done to try to prevent unbound kthreads from using isolated cores, and threads not bound with spl_taskq_thread_bind can be moved, but these solutions are iffy and incomplete at best. It would be great if ZFS respected isolcpus from the start.

Describe how to reproduce the problem

Boot with isolcpus, capture a trace of the RT CPUs with perf sched record or other tracing mechanisms, observe ZFS-spawned kthreads coming on and off isolated cores. This is the primary remaining source of latency on my local system.

Include any warning/errors/backtraces from the system logs

Jun 14 '19 23:06 sjuxax

Minimal quick and dirty patch that appears to work for me here: https://github.com/sjuxax/zfs/commit/7c2a8969b4216f00c5742d3656a201109fcc77b3 .

Looks like kthread_bind_mask isn't exported, so using cpumask_next_wrap instead. This iterates through the CPUs, but skips CPUs that have HK_FLAG_DOMAIN specified in the housekeeping API. I'm sure there are other places where the affinity needs to be set, but at a glance, this appears to quiet things down a bit. 🤷‍♂️

Jun 15 '19 04:06 sjuxax

@sjuxax your observations are correct. The other place you would have to do this is in __thread_create() in module/spl/spl-thread.c. You can see a very primitive example here: https://github.com/gamanakis/zfs/commit/b9bad20df375bc0c4245284089ac9df35df2214a

Dec 16 '19 04:12 gamanakis

@sjuxax would you mind opening a PR with the proposed fix for taskqs and dedicated threads. Then we can get you some better feedback and shouldn't lose track of this again.

Dec 16 '19 21:12 behlendorf

@behlendorf Would it be worth it to have additionally a cpulist as an spl module parameter that would bind those threads to defined cpus?

Dec 18 '19 02:12 gamanakis

Has there been any progress on fixing this defect please?

Jul 28 '20 23:07 testdasi

CPU hotpluging code changes the relevant code: https://github.com/openzfs/zfs/pull/11212 This PR will have to implement this changes in hotplug-aware way.

Nov 17 '20 22:11 IvanVolosyuk

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Nov 18 '21 10:11 stale[bot]

Has this been fixed? On my system (zfs-2.1.2-1), using isolcpus and spl_taskq_thread_bind with either 0 or 1 has no effect (ZFS still using a cpu that is excluded by isolcpus).

Mar 13 '22 06:03 Jauchi

Interesting,

I saw your reply via email and tried it myself to confirm. I am on Archlinux here using:

Kernel 5.16.13
zfs-2.1.2-1
zfs-kmod-2.1.2-1
AMD Ryzen 9 3900X
32G DDR4 @3600MHz ( 2x F4-3600C16-16GTZNC )
2TB Corsair MP600 nvme to read from as a test

My boot arguments were:

zfs=myPool/myRoot rw iommu=pt iommu=1 quiet acpi_enforce_resources=lax hugepagesz=1G hugepages=12 isolcpus=3-11,15-23 rcu_nocbs=3-11,15-23 nohz_full=3-11,15-23 systemd.unified_cgroup_hierarchy=0 rcu_nocb_poll irqaffinity=0,1,2,12,13,14

I opened htop on one screen and could already see that only cores 0,1,2 + 12,13,14 were given work by my host.

At this point I used pv /data/somelargefile.dat > /dev/null in another terminal and ZFS read it out at ~1.9gb/s.

I could see the z_rd_init_0 (and incremented) threads giving thread's 0,1,2,12,13,14 the workload of their life. But the other cores were left 100% idle. This wasn't the case before.

I tried another pv from data in an encrypted dataset and while the read speed was expectedly slower, it still only executed on the 6 cpu threads which were not isolated. I don't know why your situation is behaving differently.

Mar 13 '22 07:03 ipaqmaster

Hello, thanks for the quick and detailed reply!

I forgot to mention that I am running NixOS unstable.

I tried to adapt my system as far as possible to your kernel parameters, now I have the following cmdline (hashes and PCI IDs removed for readability): BOOT_IMAGE=(hd0,gpt2)//kernels/[...]-linux-5.15.27-bzImage init=/nix/store/[...]/init vfio-pci.ids=[...] amd_iommu=on iommu=pt iommu=1 acpi_enforce_resources=lax isolcpus=7-15 rcu_nocbs=7-15 nohz_full=7-15 rcu_nocb_poll irqaffinity=0,1,2,3,4,5,6 spl_taskq_thread_bind=1 nohibernate zfs_force=1 systemd.unified_cgroup_hierarchy=0 loglevel=4

The line spl_taskq_thread_bind=1 does not have any effect it seems, I tried booting once with and without it and it made no difference. As soon as my system is booted, there is some (less than 2%) kernel activity on core #14, along with userspace activity on 0-6. All other cores are silent.

However, I took a look in htop and zfs is not the only kernel process using that core, so likely something entirely wrong on my part.

So, this is cleary some sort of user error on my part. If you have any suggestions or Ideas, I would of course be very thankful nonetheless. If I find a solution, I will try to post it here too. Thanks!

Mar 13 '22 12:03 Jauchi

Okay, so I think I figured it out, although the reasons why it is the way it is are beyond my understanding.

To make a long story short, if I leave a CPU core between 8 and 15 for the kernel, it uses that CPU core, otherwise it will just assign a random one at boot time and be stuck with it. So, if I isolated all CPUs except 0 and 1, CPUs 2-7 would not have any kernel processes, 8-15 would have a random CPU that has the kernel processes. Basically, what I did is followed the example made by @ipaqmaster - just leave some CPUs in the upper range as well, now I can use 2-7 and 10-15 exclusively and I have no more issues, everything is nice and isolated.

Thank you!

Mar 13 '22 19:03 Jauchi

I concur with @Jauchi that with isolcpus=0-7, z_trim_int still managed to get scheduled onto cpu 0. I'll dig deeper if I have time.

Jun 18 '24 05:06 Dummyc0m

Just FYI: the isolcpus doesn't work, but it seems some of the other command line args do. I have nohz_full=1-7,16-23 rcu_nocbs=0-7,16-23 irqaffinity=8-15,24-31 rcu_nocb_poll which makes the CPU threads 1-7, 16-23 ZFS-free it seems. I also use spl.spl_taskq_thread_bind=0 spl.spl_taskq_thread_priority=0 which may matter.

Jun 18 '24 12:06 IvanVolosyuk

I just tested (on fully updated Ubuntu 24.04) with the config shared by @IvanVolosyuk (with the exception of nohz_full=1-7,17-23 - note the ticks on the sibling hyperthread) and I found ZFS pinned all its threads to core 0, which was bad news. Tried removing nohz_full and it was the same, no difference with spl_taskq_thread_bind set to 0 or 1.

Had to give the box back so ran out of time to test any more, if I can get a box with a similar CPU i'll have another go.

My objective was to only allow ZFS to use cores 8-15 and 24-31 (one of those AMD X3D CPUs with chunky L3 cache on only those cores)

Sep 25 '24 12:09 dannynodies

Yeah, I also have 7950x3d CPU. I use this config for a while without issues. My first CCD is running qemu/kvm with realtime priority (except for CPU/thread=0, which is left alone as Linux still schedules something on it). I don't use isolcpus, but I do pin vcpu and io threads in qemu; and interrupt lines in Linux kernel away from the CCD. I can observe ZFS only using second CCD in 'top' when doing heavy zstd compression in ZFS.

If you plan on buying that CPU - I would advice against it if you plan to do kvm+vfio: https://www.reddit.com/r/VFIO/comments/194ndu7/anyone_experiencing_host_random_reboots_using/

Sep 27 '24 02:09 IvanVolosyuk

How do we get this reopened and/or raise a separate bug? This bug is closed so presumably nobody is looking at it anymore.

Sep 27 '24 08:09 testdasi

@behlendorf can we reopen this as the issue is still exist. I understand that there are some technical / license issues to make it happen.

Sep 27 '24 13:09 IvanVolosyuk