zfs
zfs copied to clipboard
panic on zfs receive
System information
Type | Version/Name |
---|---|
Distribution Name | Ubuntu |
Distribution Version | 20.04 |
Linux Kernel | 5.4.0-42-generic #46-Ubuntu |
Architecture | x86_64 |
ZFS Version | 0.8.4-1~20.04.york0 |
SPL Version | 0.8.4-1~20.04.york0 |
Describe the problem you're observing
I am encountering a PANIC of ZFS upon a zfs recv that hangs the pool until the system is restarted. I initially encountered this problem with Ubuntu 20.04's official version of ZFS (0.8.3) and then tried upgrading to 0.8.4 to see if the problem still exists. The PANIC occurs in both versions. I used the pre-built version from this PPA: https://launchpad.net/~jonathonf/+archive/ubuntu/zfs
Describe how to reproduce the problem
The zfs send command NOOP output:
# /sbin/zfs send -P -R -v -n rpool/scratch@2020-07-23-03-39-10
full rpool/scratch@2020-07-23-03-39-10 27955560808
size 27955560808
tosnap: '2020-07-23-03-39-10'
fss:
0x56f7d2ee5ca60cf0:
name: 'rpool/scratch'
parentfromsnap: 0
props:
snapshot_limit: 18446744073709551614
mountpoint: 'legacy'
snaps:
2020-07-23-03-39-10: 14988460586390450179
snapprops:
2020-07-23-03-39-10:
I can issue a zfs send to /dev/null that is successful:
# /sbin/zfs send -P -R rpool/scratch@2020-07-23-03-39-10 | pv -s 27955560808 -p -r -t -a >/dev/null
full rpool/scratch@2020-07-23-03-39-10 27955560808
size 27955560808
0:01:02 [ 430MiB/s] [ 430MiB/s] [==========================] 101%
The exact zfs send command used when the destination crashes:
# /sbin/zfs send -P -R -v rpool/scratch@2020-07-23-03-39-10
The zfs recv command that crashes:
# /sbin/zfs recv -svuF dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch
Properties of the source dataset:
# zfs get all rpool/scratch@2020-07-23-03-39-10
NAME PROPERTY VALUE SOURCE
rpool/scratch@2020-07-23-03-39-10 type snapshot -
rpool/scratch@2020-07-23-03-39-10 creation Thu Jul 23 3:39 2020 -
rpool/scratch@2020-07-23-03-39-10 used 152M -
rpool/scratch@2020-07-23-03-39-10 referenced 22.1G -
rpool/scratch@2020-07-23-03-39-10 compressratio 1.21x -
rpool/scratch@2020-07-23-03-39-10 devices on default
rpool/scratch@2020-07-23-03-39-10 exec on default
rpool/scratch@2020-07-23-03-39-10 setuid on default
rpool/scratch@2020-07-23-03-39-10 createtxg 9180312 -
rpool/scratch@2020-07-23-03-39-10 xattr sa inherited from rpool
rpool/scratch@2020-07-23-03-39-10 version 5 -
rpool/scratch@2020-07-23-03-39-10 utf8only on -
rpool/scratch@2020-07-23-03-39-10 normalization formD -
rpool/scratch@2020-07-23-03-39-10 casesensitivity sensitive -
rpool/scratch@2020-07-23-03-39-10 nbmand off default
rpool/scratch@2020-07-23-03-39-10 guid 14988460586390450179 -
rpool/scratch@2020-07-23-03-39-10 primarycache all default
rpool/scratch@2020-07-23-03-39-10 secondarycache all default
rpool/scratch@2020-07-23-03-39-10 defer_destroy off -
rpool/scratch@2020-07-23-03-39-10 userrefs 1 -
rpool/scratch@2020-07-23-03-39-10 objsetid 49456 -
rpool/scratch@2020-07-23-03-39-10 mlslabel none default
rpool/scratch@2020-07-23-03-39-10 refcompressratio 1.21x -
rpool/scratch@2020-07-23-03-39-10 written 22.1G -
rpool/scratch@2020-07-23-03-39-10 clones -
rpool/scratch@2020-07-23-03-39-10 logicalreferenced 26.0G -
rpool/scratch@2020-07-23-03-39-10 acltype posixacl inherited from rpool
rpool/scratch@2020-07-23-03-39-10 context none default
rpool/scratch@2020-07-23-03-39-10 fscontext none default
rpool/scratch@2020-07-23-03-39-10 defcontext none default
rpool/scratch@2020-07-23-03-39-10 rootcontext none default
rpool/scratch@2020-07-23-03-39-10 encryption off default
Include any warning/errors/backtraces from the system logs
Relevant dmesg:
[Mon Aug 10 09:30:15 2020] VERIFY3(count >= 0) failed (-1 >= 0)
[Mon Aug 10 09:30:15 2020] PANIC at dsl_dir.c:925:dsl_fs_ss_count_adjust()
[Mon Aug 10 09:30:15 2020] Showing stack for process 10815
[Mon Aug 10 09:30:15 2020] CPU: 0 PID: 10815 Comm: txg_sync Tainted: P OE 5.4.0-42-generic #46-Ubuntu
[Mon Aug 10 09:30:15 2020] Hardware name: HPE ProLiant MicroServer Gen10/ProLiant MicroServer Gen10, BIOS 5.12 06/26/2018
[Mon Aug 10 09:30:15 2020] Call Trace:
[Mon Aug 10 09:30:15 2020] dump_stack+0x6d/0x9a
[Mon Aug 10 09:30:15 2020] spl_dumpstack+0x29/0x2b [spl]
[Mon Aug 10 09:30:15 2020] spl_panic+0xd4/0xfc [spl]
[Mon Aug 10 09:30:15 2020] ? avl_find+0x5f/0x90 [zavl]
[Mon Aug 10 09:30:15 2020] ? dbuf_rele+0x3a/0x40 [zfs]
[Mon Aug 10 09:30:15 2020] ? dmu_buf_rele+0xe/0x10 [zfs]
[Mon Aug 10 09:30:15 2020] ? zap_unlockdir+0x3f/0x60 [zfs]
[Mon Aug 10 09:30:15 2020] ? zap_lookup_norm+0x95/0xc0 [zfs]
[Mon Aug 10 09:30:15 2020] dsl_fs_ss_count_adjust+0x1ad/0x1c0 [zfs]
[Mon Aug 10 09:30:15 2020] dsl_destroy_head_sync_impl+0x35c/0x980 [zfs]
[Mon Aug 10 09:30:15 2020] ? dbuf_rele+0x3a/0x40 [zfs]
[Mon Aug 10 09:30:15 2020] dmu_recv_end_sync+0x1ec/0x5a0 [zfs]
[Mon Aug 10 09:30:15 2020] dsl_sync_task_sync+0xb6/0x100 [zfs]
[Mon Aug 10 09:30:15 2020] dsl_pool_sync+0x3d6/0x4c0 [zfs]
[Mon Aug 10 09:30:15 2020] spa_sync+0x59b/0xf90 [zfs]
[Mon Aug 10 09:30:15 2020] ? spa_txg_history_init_io+0x106/0x110 [zfs]
[Mon Aug 10 09:30:15 2020] txg_sync_thread+0x2be/0x450 [zfs]
[Mon Aug 10 09:30:15 2020] ? txg_thread_exit.isra.0+0x60/0x60 [zfs]
[Mon Aug 10 09:30:15 2020] thread_generic_wrapper+0x79/0x90 [spl]
[Mon Aug 10 09:30:15 2020] kthread+0x104/0x140
[Mon Aug 10 09:30:15 2020] ? __thread_exit+0x20/0x20 [spl]
[Mon Aug 10 09:30:15 2020] ? kthread_park+0x90/0x90
[Mon Aug 10 09:30:15 2020] ret_from_fork+0x22/0x40
I've done plenty of sends and receives between these 2 systems but this only particular dataset is causing this problem, so reproducing it elsewhere may be difficult. Is this certainly a problem with the destination side, or could it be a problem with the ZFS stream?
What other information can I provide that would help narrow the problem down? Should the version of ZFS on the source machine matter even though its the receive that is crashing? The source machine is running kernel 5.4.0-40-generic with the official Ubuntu zfs/spl 0.8.3-1ubuntu12.2.
cc @pcd1193182 @jjelinek
It looks like you are doing a "blow away" receive, where we are receiving a full (non-incremental) stream on top of a filesystem that already exists (dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch).
In this case, we receive into a temporary %recv
filesystem, which is deleted when the receive completes. For fs/ss accounting purposes, these filesystems should be ignored. But that seems like it is not happening here, we try to decrement the fs count but it's already zero. So we need to figure out why it isn't ignoring this filesystem.
Any chance you can provide a crash dump?
If you haven't configured the system to reboot when zfs panic's, you might be able to get the output of cat /proc/spl/kstat/zfs/dbgmsg
(at least the end of it, starting when you do the receive), which might also be helpful.
If you haven't configured the system to reboot when zfs panic's, you might be able to get the output of
cat /proc/spl/kstat/zfs/dbgmsg
(at least the end of it, starting when you do the receive), which might also be helpful.
this must first generally be enabled by toggling debug flags (on Linux) before reproducing the issue. they might also want to bump the max size of the log. @ahrens can you suggest some debug flags they might want to enable before re-testing, to save time?
I noticed that in the NOOP send, there's the line:
snapshot_limit: 18446744073709551614
The source dataset has that value:
# zfs get snapshot_limit -p dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch
NAME PROPERTY VALUE SOURCE
dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch snapshot_limit 18446744073709551614 local
As does the destination:
# /home/jeremy/documents/migrate-zfs-system # zfs get snapshot_limit -p rpool/scratch
NAME PROPERTY VALUE SOURCE
rpool/scratch snapshot_limit 18446744073709551614 local
I set that set that value a while back on the source dataset to the maximum value I could. So maybe there's a off-by-1 bug or something...
Since the dmesg line mentions a function ( dsl_fs_ss_count_adjust ) that appears related to snapshot count:
[Mon Aug 10 09:30:15 2020] PANIC at dsl_dir.c:925:dsl_fs_ss_count_adjust()
... looks like it might be related to this property. The file dsl_dir.c says:
The filesystem and snapshot limits are validated by dsl_fs_ss_limit_check() and updated by dsl_fs_ss_count_adjust(). A new limit value is setup in dsl_dir_activate_fs_ss_limit() and the counts are adjusted, if necessary, by dsl_dir_init_fs_ss_count().
Where it PANICED:
/*
* If we hit an uninitialized node while recursing up the tree, we can
* stop since we know the counts are not valid on this node and we
* know we shouldn't touch this node's counts. An uninitialized count
* on the node indicates that either the feature has not yet been
* activated or there are no limits on this part of the tree.
*/
if (!dsl_dir_is_zapified(dd) || (err = zap_lookup(os, dd->dd_object,
prop, sizeof (count), 1, &count)) == ENOENT)
return;
VERIFY0(err);
count += delta;
/* Use a signed verify to make sure we're not neg. */
VERIFY3S(count, >=, 0); /* PANICED HERE */
I bet its something there...
I can still provide a dbgmsg output if you like, but I'm out of gas for the moment as I've been trying a few things to workaround this, but I do know that lowering the snapshot limit values on the source and destination dataset before the send/recv did not make a difference.
Ahh, what the heck: Here's the only dbgmsg output from the recv command:
1597125475 spa_history.c:309:spa_history_log_sync(): txg 2828496 resume receive dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch/%recv (id 530023)
1597125475 spa_history.c:309:spa_history_log_sync(): txg 2828497 finish receiving dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch/%recv (id 530023) snap=2020-07-23-03-39-10
1597125475 spa_history.c:309:spa_history_log_sync(): txg 2828497 clone swap dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch/%recv (id 530023) parent=scratch
1597125475 spa_history.c:309:spa_history_log_sync(): txg 2828497 snapshot dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch@2020-07-23-03-39-10 (id 562365)
1597125475 spa_history.c:309:spa_history_log_sync(): txg 2828497 destroy dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch/%recv (id 530023)
This also crashes when sending to an entirely new dataset on a different pool, different server (5.4.0-40-generic / 0.8.3-1ubuntu12), even after reducing the snapshot_limit to 9999999 (but this time it sends all the data before crashing).
I think that the panic occurs when destroying the %recv
dataset (the last line in the dbgmsg). In dsl_fs_ss_count_adjust()
we have this code, which should be causing it to bail out before the assertion that we hit:
/*
* When we receive an incremental stream into a filesystem that already
* exists, a temporary clone is created. We don't count this temporary
* clone, whose name begins with a '%'. We also ignore hidden ($FREE,
* $MOS & $ORIGIN) objsets.
*/
if ((dd->dd_myname[0] == '%' || dd->dd_myname[0] == '$') &&
strcmp(prop, DD_FIELD_FILESYSTEM_COUNT) == 0)
return;
Hmm. So either the assertion fails when dsl_fs_ss_count_adjust()
was called with dataset %recv
and prop
being DD_FIELD_SNAPSHOT_COUNT
or with dataset being something else, right? Could it be that the snapshot count for %recv
was somehow not incremented after the receive but before the call to dsl_fs_ss_count_adjust()
?
What's the best way for me to get more details on what's happening? I can run bpftrace if someone can point me to the probes to monitor.
You could use bpftrace to print out the stack, dd_myname
and prop
when dsl_fs_ss_count_adjust() is called.
I ran into another problem (#10787) that's kept me from further troubleshooting this. Also, I couldn't find how to print the stack trace and those vars, can anyone help with exactly how to do that?
I was hoping this would have been resolved by some changes in the 2.0 release, but I am still encountering what appears to be the same PANIC when attempting to destroy a dataset. This is on a production system running zfs on root so when this PANIC happens, I have to reboot the entire server. I will try to grab more details and report back...
Type | Version/Name |
---|---|
Distribution Name | Ubuntu |
Distribution Version | 20.04 |
Linux Kernel | 5.4.0-64-generic #72-Ubuntu |
Architecture | x86_64 |
ZFS Version | 2.0.1-0york0~18.04 |
SPL Version | 2.0.1-0york0~18.04 |
root@myhost-mirror ~ # zfs destroy -r -v data/myhost-main/data/backupdata/veeam/repos/repo_01_old
will destroy data/myhost-main/data/backupdata/veeam/repos/repo_01_old/%recv
root@myhost-mirror ~ # tail /var/log/syslog
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.505901] VERIFY3(count >= 0) failed (-1 >= 0)
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.510578] PANIC at dsl_dir.c:929:dsl_fs_ss_count_adjust()
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516209] Showing stack for process 83858
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516212] CPU: 58 PID: 83858 Comm: txg_sync Tainted: P OE 5.4.0-64-generic #72-Ubuntu
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516213] Hardware name: iXsystems IXR4224-E2C-1280-IXN/X10DRH-CLN4, BIOS 3.2 11/21/2019
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516214] Call Trace:
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516234] dump_stack+0x6d/0x9a
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516245] spl_dumpstack+0x29/0x2b [spl]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516249] spl_panic+0xd4/0xfc [spl]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516259] ? avl_find+0x5f/0x90 [zavl]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516321] ? dbuf_rele+0x3d/0x50 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516370] ? dmu_buf_rele+0xe/0x10 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516431] ? zap_unlockdir+0x3f/0x60 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516491] ? zap_lookup_norm+0x95/0xc0 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516537] dsl_fs_ss_count_adjust+0x1a8/0x1c0 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516592] dsl_destroy_head_sync_impl+0x812/0xe00 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516634] dsl_destroy_head_sync+0x5d/0xc0 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516682] dsl_sync_task_sync+0xb6/0x100 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516725] dsl_pool_sync+0x3d4/0x4c0 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516784] spa_sync+0x5a3/0x1000 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516790] ? mutex_lock+0x13/0x40
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516852] ? spa_txg_history_init_io+0x106/0x110 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516905] txg_sync_thread+0x2c6/0x460 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516960] ? txg_thread_exit.isra.0+0x60/0x60 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516967] thread_generic_wrapper+0x79/0x90 [spl]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516972] kthread+0x104/0x140
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516976] ? __thread_exit+0x20/0x20 [spl]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516980] ? kthread_park+0x90/0x90
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516983] ret_from_fork+0x35/0x40
Dataset details:
root@myhost-mirror ~ # cat /sys/module/zfs/parameters/zfs_dbgmsg_enable
1
root@myhost-mirror ~ # zfs get all data/myhost-main/data/backupdata/veeam/repos/repo_01_old
NAME PROPERTY VALUE SOURCE
data/myhost-main/data/backupdata/veeam/repos/repo_01_old type volume -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old creation Mon Oct 5 22:29 2020 -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old used 10.8T -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old available 54.2T -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old referenced 9.52T -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old compressratio 1.00x -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old reservation 64T received
data/myhost-main/data/backupdata/veeam/repos/repo_01_old volsize 56T local
data/myhost-main/data/backupdata/veeam/repos/repo_01_old volblocksize 1M -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old checksum on default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old compression off default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old readonly off default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old createtxg 8608789 -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old copies 1 default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old refreservation none received
data/myhost-main/data/backupdata/veeam/repos/repo_01_old guid 16399445975178063768 -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old primarycache all default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old secondarycache all default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old usedbysnapshots 941G -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old usedbydataset 9.52T -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old usedbychildren 326G -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old usedbyrefreservation 0B -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old logbias latency default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old objsetid 1743950 -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old dedup off default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old mlslabel none default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old sync always received
data/myhost-main/data/backupdata/veeam/repos/repo_01_old refcompressratio 1.00x -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old written 0 -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old logicalused 11.3T -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old logicalreferenced 9.97T -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old volmode default default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old snapshot_limit none default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old snapshot_count 22 local
data/myhost-main/data/backupdata/veeam/repos/repo_01_old snapdev hidden default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old context none default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old fscontext none default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old defcontext none default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old rootcontext none default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old redundant_metadata all default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old receive_resume_token 1-17920e7c1f-168-789c636064000310a501c49c50360710a715e5e7a69766a630404164fcbe531785f79f5700b2d991d4e52765a5269740f82080219f96569c5a0296f9ffdf0422cf86249f5459925a0ca425b6bc730fc5a2bf241fe28ade9b7758d9e2dfd94520c97382e5f3127353191852124b12f5931293b34b0bc0ccb2d4d4c45cfda2d482fc6230196f60e850549a97979997ae6b646064a06b68a06b60a66b68a86b0a641882ede5654084474e62517a6a524e7e72767e36484c02ea1e987c6a6e526a0a500aa48f1b493c393fb7a028b5b818a20b02007d2a3ec7 -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old encryption off default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old keylocation none default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old keyformat none default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old pbkdf2iters 0 default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old zrepl:placeholder on inherited from data/myhost-main/data/backupdata
The PANIC does not output anything to /proc/spl/kstat/zfs/dbgmsg
although that does have some output from other things beforehand so it appears to be enabled.
The originally-reported issue should have been resolved by https://github.com/openzfs/zfs/pull/10791, which is fixed in 2.0. It's possible that if your filesystem (data/myhost-main/data/backupdata/veeam/repos/repo_01_old) was created on an earlier release, you could still hit this when destroying it.
I also tried aborting the send:
root@myhost-mirror ~ # zfs recv -A data/myhost-main/data/backupdata/veeam/repos/repo_01_old
Message from syslogd@myhost-mirror at Jan 26 16:01:26 ...
kernel:[ 534.938981] VERIFY3(count >= 0) failed (-1 >= 0)
Message from syslogd@myhost-mirror at Jan 26 16:01:26 ...
kernel:[ 534.943665] PANIC at dsl_dir.c:929:dsl_fs_ss_count_adjust()
From dmseg:
[Tue Jan 26 16:01:26 2021] VERIFY3(count >= 0) failed (-1 >= 0)
[Tue Jan 26 16:01:26 2021] PANIC at dsl_dir.c:929:dsl_fs_ss_count_adjust()
[Tue Jan 26 16:01:26 2021] Showing stack for process 80062
[Tue Jan 26 16:01:26 2021] CPU: 21 PID: 80062 Comm: txg_sync Tainted: P OE 5.4.0-64-generic #72-Ubuntu
[Tue Jan 26 16:01:26 2021] Hardware name: iXsystems IXR4224-E2C-1280-IXN/X10DRH-CLN4, BIOS 3.2 11/21/2019
[Tue Jan 26 16:01:26 2021] Call Trace:
[Tue Jan 26 16:01:26 2021] dump_stack+0x6d/0x9a
[Tue Jan 26 16:01:26 2021] spl_dumpstack+0x29/0x2b [spl]
[Tue Jan 26 16:01:26 2021] spl_panic+0xd4/0xfc [spl]
[Tue Jan 26 16:01:26 2021] ? avl_find+0x5f/0x90 [zavl]
[Tue Jan 26 16:01:26 2021] ? dbuf_rele+0x3d/0x50 [zfs]
[Tue Jan 26 16:01:26 2021] ? dmu_buf_rele+0xe/0x10 [zfs]
[Tue Jan 26 16:01:26 2021] ? zap_unlockdir+0x3f/0x60 [zfs]
[Tue Jan 26 16:01:26 2021] ? zap_lookup_norm+0x95/0xc0 [zfs]
[Tue Jan 26 16:01:26 2021] dsl_fs_ss_count_adjust+0x1a8/0x1c0 [zfs]
[Tue Jan 26 16:01:26 2021] dsl_destroy_head_sync_impl+0x812/0xe00 [zfs]
[Tue Jan 26 16:01:26 2021] dsl_destroy_head_sync+0x5d/0xc0 [zfs]
[Tue Jan 26 16:01:26 2021] dsl_sync_task_sync+0xb6/0x100 [zfs]
[Tue Jan 26 16:01:26 2021] dsl_pool_sync+0x3d4/0x4c0 [zfs]
[Tue Jan 26 16:01:26 2021] spa_sync+0x5a3/0x1000 [zfs]
[Tue Jan 26 16:01:26 2021] ? mutex_lock+0x13/0x40
[Tue Jan 26 16:01:26 2021] ? spa_txg_history_init_io+0x106/0x110 [zfs]
[Tue Jan 26 16:01:26 2021] txg_sync_thread+0x2c6/0x460 [zfs]
[Tue Jan 26 16:01:26 2021] ? txg_thread_exit.isra.0+0x60/0x60 [zfs]
[Tue Jan 26 16:01:26 2021] thread_generic_wrapper+0x79/0x90 [spl]
[Tue Jan 26 16:01:26 2021] kthread+0x104/0x140
[Tue Jan 26 16:01:26 2021] ? __thread_exit+0x20/0x20 [spl]
[Tue Jan 26 16:01:26 2021] ? kthread_park+0x90/0x90
[Tue Jan 26 16:01:26 2021] ret_from_fork+0x35/0x40
The originally-reported issue should have been resolved by #10791, which is fixed in 2.0. It's possible that if your filesystem (data/myhost-main/data/backupdata/veeam/repos/repo_01_old) was created on an earlier release, you could still hit this when destroying it.
Yes, it was created in the 0.8.x release. Any idea how I can safely get rid of this dataset then?
If you can remove the VERIFY and compile from source, the destroy might "just work" and not introduce any more problems.
I have yet to be able to try that as the system encountering this problem is in production and it's critical that it stay up and running. But I have also now encountered this problem on another system that has hundreds of datasets that I am unable to destroy or destroy snapshots from due to the same problem. And since this server is continuously receiving snapshots from other systems, it is only a matter of time before this pool runs out of space. Yikes! I am now in a race against the clock to solve this, so I will likely be testing this shortly.
But let's say it does work... Shouldn't that change (or a more proper fix for the issue) be made to zfs (not just the manually compiled version on my server) to avoid this problem showing up again in the future for anyone else?
Also, I can't readily tell at the moment, but is that change to dsl_dir.c part of the kernel module or the zfs userland CLI program?
Thanks!
On a system running zfs 0.8.4 I have rebuilt zfs, commenting out the VERIFY statement in question. So far I have been able to destroy such problematic datasets (and snapshots on those datasets) without any visible sign of a problem.
How should we move forward with this information to fix this problem permanently for everyone? I don't want to have to rebuild zfs with this change for every new version and don't want to have to migrate data from problematic datasets to new datasets. Could there be a way to fix the affected datasets in-place?
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.