zfs Dynamically reducing zfetch_max_distance causes CPU soft lockup

Dynamically reducing zfetch_max_distance results in soft lockup. This was discovered during a typo where the first digit was removed accidentally during testing. Increasing did not fix the issue, it required a reboot to fix.

How to reproduce;

$ invoke some sort of heavy read load (500mb/sec)
$ echo 4147483648 > /sys/module/zfs/parameters/zfetch_max_distance
$ echo 147483648 > /sys/module/zfs/parameters/zfetch_max_distance

Versions:

~# zfs -V
zfs-2.0.3-9~bpo10+1
zfs-kmod-2.0.3-9~bpo10+1

Debian 10 latest

Linux hostname-1 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux

Kernel log

kernel:[1667823.937275] watchdog: BUG: soft lockup - CPU#18 stuck for 23s! [migration/18:103]

[1667791.913918] watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [tar:35371]
[1667791.913974] Modules linked in: binfmt_misc macvlan bonding zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) ses enclosure ipmi_ssif intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel mgag200 joydev intel_cstate ttm drm_kms_helper hid_generic evdev intel_uncore usbhid pcc_cpufreq dcdbas drm intel_rapl_perf hid wdat_wdt pcspkr i2c_algo_bit sg mei_me mei ioatdma ipmi_si ipmi_devintf wmi ipmi_msghandler acpi_power_meter button ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb btrfs xor zstd_decompress zstd_compress xxhash raid6_pq libcrc32c crc32c_generic sd_mod crc32c_intel ahci libahci libata aesni_intel ehci_pci ehci_hcd mpt3sas aes_x86_64 crypto_simd
[1667791.914029]  megaraid_sas cryptd nvme raid_class glue_helper ixgbe usbcore scsi_transport_sas nvme_core tg3 lpc_ich scsi_mod mfd_core usb_common libphy dca mdio
[1667791.914043] CPU: 9 PID: 35371 Comm: tar Tainted: P           OEL    4.19.0-17-amd64 #1 Debian 4.19.194-3
[1667791.914044] Hardware name: Dell Inc. PowerEdge R720/0VRCY5, BIOS 2.9.0 12/06/2019
[1667791.914158] RIP: 0010:dbuf_prefetch+0xd/0x5a0 [zfs]
[1667791.914161] Code: e9 fc ff ff e9 41 8e 17 00 e8 bf 96 49 fb 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 <55> 53 48 81 ec d8 00 00 00 89 0c 24 44 89 44 24 0c 65 48 8b 04 25
[1667791.914163] RSP: 0018:ffffaa8d6e533c00 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
[1667791.914165] RAX: 0000000000000000 RBX: 00000000ffff88cc RCX: 0000000000000002
[1667791.914166] RDX: 000000013cdb7340 RSI: 0000000000000001 RDI: ffff89b58eb74908
[1667791.914168] RBP: 000000013cdb7341 R08: 0000000000000020 R09: 0000000000000001
[1667791.914169] R10: 0000000000000051 R11: ffffffffffff8acd R12: 003fffffffffffe4
[1667791.914170] R13: ffff89b58eb74c88 R14: ffff89b58eb74ca8 R15: ffff89c12e262878
[1667791.914172] FS:  00007f3a93b51800(0000) GS:ffff89c20f700000(0000) knlGS:0000000000000000
[1667791.914173] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1667791.914174] CR2: 00007f7c59231000 CR3: 000000050c528003 CR4: 00000000001606e0
[1667791.914176] Call Trace:
[1667791.914241]  dmu_zfetch+0x4b3/0x5d0 [zfs]
[1667791.914301]  dmu_buf_hold_array_by_dnode+0x389/0x490 [zfs]
[1667791.914359]  dmu_read_uio_dnode+0x47/0xf0 [zfs]
[1667791.914438]  ? zfs_rangelock_enter_impl+0x25b/0x560 [zfs]
[1667791.914443]  ? copyin+0x25/0x30
[1667791.914499]  dmu_read_uio_dbuf+0x3f/0x60 [zfs]
[1667791.914576]  zfs_read+0x135/0x3b0 [zfs]
[1667791.914654]  zpl_iter_read+0xe0/0x180 [zfs]
[1667791.914661]  new_sync_read+0xf8/0x160
[1667791.914665]  vfs_read+0x91/0x140
[1667791.914668]  ksys_read+0x57/0xd0
[1667791.914673]  do_syscall_64+0x53/0x110
[1667791.914679]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1667791.914682] RIP: 0033:0x7f3a93ce2461
[1667791.914685] Code: fe ff ff 50 48 8d 3d fe d0 09 00 e8 e9 03 02 00 66 0f 1f 84 00 00 00 00 00 48 8d 05 99 62 0d 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 c3 66 0f 1f 44 00 00 41 54 49 89 d4 55 48
[1667791.914686] RSP: 002b:00007ffcf5174f48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[1667791.914688] RAX: ffffffffffffffda RBX: 0000000000002800 RCX: 00007f3a93ce2461
[1667791.914689] RDX: 0000000000002800 RSI: 0000556504911000 RDI: 0000000000000003
[1667791.914690] RBP: 00005565049167c0 R08: 0000000000000000 R09: 0000556504916880
[1667791.914691] R10: 0000000000000124 R11: 0000000000000246 R12: 0000556504911000
[1667791.914693] R13: 0000000000000003 R14: 0000000000000030 R15: 0000000000002800
[1667795.937837] watchdog: BUG: soft lockup - CPU#18 stuck for 23s! [migration/18:103]

Nov 09 '21 00:11 foxx

This is duplicate of #9411

Unfortunately, developers don't seem to be interested in fixing it and it was automatically closed...

Nov 10 '21 18:11 vstax

@foxx @vstax would it be possible for you to try and reproduce this issue with the 2.1.1 release? The zfetch code was significantly reworked between the 2.0 and 2.1 releases, it would be helpful to get a backtrace from the newer code and to confirm the issue still exists.

Nov 10 '21 20:11 behlendorf

Thanks @behlendorf, I'll try to reproduce with our test server under 2.1.1. I'll take a look at this over the weekend and will come back next week.

Nov 11 '21 14:11 foxx

Cannot reproduce this with 2.1.99-1

Nov 14 '21 13:11 h1z1

Possibly related to this issue, when testing on two separate systems (both on zfs-2.1.5 | 5.15.39-2-pve), with zfetch_max_distance = 268435456, one system had soft lockups occur while exporting an idle zpool, and the other system had the same occur while issuing a zpool scrub -s to a scrubbing pool. Logs showed a bunch of multipath timeouts occurring during this same period (likely related to the soft lock), with the failed paths being reinstated a few seconds later.

The high zfetch_max_distance was due to tuning for high sequential throughputs (file transfer / scrub) from a very wide raidz3 vdev. The performance difference for this case was significant.

Sep 12 '22 16:09 malventano

@malventano Do you have some reason to think it is related other than the set parameter?

BTW 884364ea85423fdd431f6269c42228ab74c4846d recently (in 2.1.6-staging now) further improved the prefetcher logic to use bigger prefetch when it is needed, including bigger default distance, but not only.

Sep 12 '22 22:09 amotin

@malventano Do you have some reason to think it is related other than the set parameter?

The thought was based on these two systems only seeing CPU soft lockups occur when manipulating zpools with this same parameter set to a non-default value. Only a few other more common tweaks were present prior (zfs_dirty_data_max = 8GB and /sys/module/zfs/parameters/zfs_txg_timeout = 10s), but those adjustments never resulted in any soft lockups until zfetch_max_distance was altered. The overlap with the OP issue appeared sufficient to suspect a relation, even though OP was reducing the value while I had increased it, as the same parameter was altered in both cases, and two different subsequent zpool operations across two separate systems triggered the same soft lockup behavior with this change present.

BTW 884364e recently (in 2.1.6-staging now) further improved the prefetcher logic to use bigger prefetch when it is needed, including bigger default distance, but not only.

That's a great addition and it would mitigate the need for most folks to adjust zfetch_max_distance (64MB was running into diminishing returns and got me most of the way to the peak sequential read throughput observed with it set to 256MB, so the choice of 64MB as the new default max is a good one, especially given that the prefetch will now build up to that value only when/as necessary). Thanks for pointing me to that.

Sep 13 '22 01:09 malventano

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Sep 16 '23 21:09 stale[bot]

zfs zfs copied to clipboard

Dynamically reducing zfetch_max_distance causes CPU soft lockup

zfs
zfs copied to clipboard