zfs ZTS: FreeBSD 13 panics on cp_stress.ksh (reproducable)

System information

Type	Version/Name
Distribution Name	FreeBSD
Distribution Version	13.2-RELEASE-p10
Kernel Version
Architecture	x86-64
OpenZFS Version	master (c98295eed2687cee704ef5f8f3218d3d44a6a1d8)

Describe the problem you're observing

You can easily panic FreeBSD 13 by running the cp_stress.ksh ZTS test.

Describe how to reproduce the problem

Run the cp_stress tests on FreeBSD 13:

 ./scripts/zfs-tests.sh -x -t `pwd`/tests/zfs-tests/tests/functional/cp_files/cp_stress.ksh

Sometimes it passes, but typically it will panic after 1-3 tries. I hit it running on a VM with 4 vCPUs.

Include any warning/errors/backtraces from the system logs

panic: (link->list_next == NULL) is equivalent to (link->list_prev == NULL)
cpuid = 2
time = 1719250843
KDB: stack backtrace:
#0 0xffffffff80c53ff5 at kdb_backtrace+0x65
#1 0xffffffff80c06971 at vpanic+0x151
#2 0xffffffff824419ba at spl_panic+0x3a
#3 0xffffffff82440095 at list_link_active+0x55
#4 0xffffffff824ec3d3 at dnode_is_dirty+0x93
#5 0xffffffff824c6e87 at dmu_offset_next+0x57
#6 0xffffffff8264eb0d at zfs_holey+0x14d
#7 0xffffffff8246272f at zfs_freebsd_ioctl+0x4f
#8 0xffffffff80cf9474 at vn_ioctl+0x1a4
#9 0xffffffff80cf9dac at vn_seek+0x20c
#10 0xffffffff80cf289b at kern_lseek+0x6b
#11 0xffffffff810b289c at amd64_syscall+0x10c
#12 0xffffffff81089a8b at fast_syscall_common+0xf8
Uptime: 27m0s
Dumping 406 out of 4062 MB:..4%..12%..24%..32%..44%..52%..63%..71%..83%..91%

Jun 24 '24 17:06 tonyhutter

Unable to reproduce on 13.2-RELEASE-p11, OpenZFS c98295e. Tried VM with 2x and 4x cores, and 2G and 16G RAM.

Typical run:

robn@freebsd13:~/zfs $ ./scripts/zfs-tests.sh -Dvxt cp_stress

--- Cleanup ---
Removing pool(s):
Removing loopback(s):
Removing files(s):

--- Configuration ---
Runfiles:        /var/tmp/zfs-tests.2773.run
STF_TOOLS:       /home/robn/zfs/tests/test-runner
STF_SUITE:       /home/robn/zfs/tests/zfs-tests
STF_PATH:        /home/robn/zfs/tests/zfs-tests/bin
FILEDIR:         /var/tmp
FILES:           /var/tmp/file-vdev0 /var/tmp/file-vdev1 /var/tmp/file-vdev2
LOOPBACKS:       md0 md1 md2
DISKS:           md0 md1 md2
NUM_DISKS:       3
FILESIZE:        4G
ITERATIONS:      1
TAGS:            functional
STACK_TRACER:    no
Keep pool(s):    rpool
Missing util(s): arc_summary arcstat zilstat dbufstat mount.zfs zed zgenhostid devname2devid file_fadvise getversion mmap_libaio randfree_file read_dos_attributes renameat2 user_ns_exec write_dos_attributes xattrtest zed_fd_spill-zedlet idmap_util fio net pamtester rsync

/home/robn/zfs/tests/test-runner/bin/test-runner.py  -D   -c "/var/tmp/zfs-tests.2773.run" -T "functional" -i "/home/robn/zfs/tests/zfs-tests" -I "1"
NOTE: begin default_setup_noexit
SUCCESS: zpool create -f testpool md0
SUCCESS: zfs create testpool/testfs
SUCCESS: zfs set mountpoint=/var/tmp/testdir testpool/testfs
Test: /home/robn/zfs/tests/zfs-tests/tests/functional/cp_files/setup (run as root) [00:00] [PASS]
ASSERTION: Run the 'seekflood' binary repeatedly to try to trigger #15526
SUCCESS: mkdir /testpool/cp_stress
SUCCESS: /home/robn/zfs/tests/zfs-tests/tests/functional/cp_files/seekflood 2000 4
SUCCESS: /home/robn/zfs/tests/zfs-tests/tests/functional/cp_files/seekflood 2000 4
SUCCESS: /home/robn/zfs/tests/zfs-tests/tests/functional/cp_files/seekflood 2000 4
No corruption detected
NOTE: Performing local cleanup via log_onexit (cleanup)
Test: /home/robn/zfs/tests/zfs-tests/tests/functional/cp_files/cp_stress.ksh (run as root) [00:15] [PASS]
SUCCESS: zpool destroy -f testpool
SUCCESS: rm -rf /var/tmp/testdir
Test: /home/robn/zfs/tests/zfs-tests/tests/functional/cp_files/cleanup (run as root) [00:00] [PASS]

Results Summary
PASS	  3

Running Time:	00:00:15
Percent passed:	100.0%
Log directory:	/var/tmp/test_results/20240625T132626

Tests with results other than PASS that are expected:

Tests with result of PASS that are unexpected:

Tests with results other than PASS that are unexpected:

Anything interesting in your ZTS config? Specifically, I'm wondering about whether or not you're using "real" disks or the default files in /var/tmp. If the latter, is /var/tmp itself backed by ZFS, or UFS?

I'll stick it in a loop for an hour, try a bit harder. If that doesn't turn up anything, I'll make a real pool and put seekflood on a long run.

Jun 25 '24 03:06 robn

I ran the test over and over for a few hours (I forgot about it...), no dice. I set it up to run many thousands of files & threads for a good long while, no change there either. Finally, I reran all that against the builtin OpenZFS in 13.2, which also refused to blow up.

robn@freebsd13:~ $ zfs version
zfs-2.1.9-FreeBSD_g92e0d9d18
zfs-kmod-2.1.9-FreeBSD_g92e0d9d18

robn@freebsd13:~ $ uname -a
FreeBSD freebsd13 13.2-RELEASE-p11 FreeBSD 13.2-RELEASE-p11 GENERIC amd64

So more info needed!

Jun 26 '24 02:06 robn

Specifically, I'm wondering about whether or not you're using "real" disks or the default files in /var/tmp. If the latter, is /var/tmp itself backed by ZFS, or UFS?

Just UFS for all of /. I'm using the defaults for ZTS so I assume it's the /var/tmp disks.

I can still hit this 100% reliably in ZTS, but when I run the same seekflood binary manually to re-create the test by hand, I'm unable to hit the panic. It's very weird.

Jun 27 '24 00:06 tonyhutter

I can't reproduce the failing cp_stress tests on my latest FreeBSD runs:

https://github.com/mcmilk/zfs/actions/runs/10226432155
https://github.com/mcmilk/zfs/actions/runs/10226053058

Aug 04 '24 13:08 mcmilk

I can reproduce the same panic using devel/fsx instead of cp_stress, on FreeBSD 15.0-CURRENT. In a VM I created a pool backed by a single virtio disk, with recsize set to 64k and compression set to lz4. Then I used this fsx.toml file:

# Large enough that, given a 64k recsize, there will be at least one L2 block
flen=1073741824

[opsize]
max = 1048576

[weights]
mapread = 1
mapwrite = 1
invalidate = 0
truncate = 0
punch_hole = 1
sendfile = 1
copy_file_range = 1

And started four simultaneous instances each operating on a different file, each in a separate tmux pane, with fsx -vf fsx.toml fsx.1.bin. I left it for hours and came back to this panic:

panic: (link->list_next == NULL) is equivalent to (link->list_prev == NULL)
cpuid = 9
time = 1742750674
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0159d1a750
vpanic() at vpanic+0x136/frame 0xfffffe0159d1a880
spl_panic() at spl_panic+0x3a/frame 0xfffffe0159d1a8e0
list_link_active() at list_link_active+0x4f/frame 0xfffffe0159d1a900
dnode_is_dirty() at dnode_is_dirty+0x95/frame 0xfffffe0159d1a920
dmu_offset_next() at dmu_offset_next+0x52/frame 0xfffffe0159d1a960
zfs_holey() at zfs_holey+0xff/frame 0xfffffe0159d1a9c0
zfs_freebsd_ioctl() at zfs_freebsd_ioctl+0x76/frame 0xfffffe0159d1aa00
vn_generic_copy_file_range() at vn_generic_copy_file_range+0x522/frame 0xfffffe0159d1ac20
vn_copy_file_range() at vn_copy_file_range+0x28f/frame 0xfffffe0159d1ad00
kern_copy_file_range() at kern_copy_file_range+0x2b9/frame 0xfffffe0159d1adb0
sys_copy_file_range() at sys_copy_file_range+0x78/frame 0xfffffe0159d1ae00
amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe0159d1af30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0159d1af30
--- syscall (569, FreeBSD ELF64, copy_file_range), rip = 0x24530942055a, rsp = 0x245306e764c8, rbp = 0x40000000 ---
KDB: enter: panic

Mar 24 '25 14:03 asomers