zfs-2.1.7 zfs recv panic in dsl_dataset_deactivate_feature_impl
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Fedora |
| Distribution Version | 36 |
| Kernel Version | 6.0.10-200.fc36.x86_64 |
| Architecture | Intel x86_64 |
| OpenZFS Version | zfs-2.1.7-1 |
Describe the problem you're observing
When using zfs send to make a backup on a remote machine, the receiver throws a PANIC in one of the zfs functions and the file system deadlocks.
Describe how to reproduce the problem
zfs send -RLec -I snapshot_name | ssh remote_host "zfs receive -duF remote_zvol"
Include any warning/errors/backtraces from the system logs
System Events
=-=-=-=-=-=-=
Dec 2 15:32:14 raid3 kernel: VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
Dec 2 15:32:14 raid3 kernel: PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
Dec 2 15:32:14 raid3 kernel: Showing stack for process 1604
Dec 2 15:32:14 raid3 kernel: CPU: 3 PID: 1604 Comm: txg_sync Tainted: P OE 6.0.10-200.fc36.x86_64 #1
Dec 2 15:32:14 raid3 kernel: Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD3H/Z87X-UD3H-CF, BIOS 10b 06/12/2014
Dec 2 15:32:14 raid3 kernel: Call Trace:
Dec 2 15:32:14 raid3 kernel: <TASK>
Dec 2 15:32:14 raid3 kernel: dump_stack_lvl+0x44/0x5c
Dec 2 15:32:14 raid3 kernel: spl_panic+0xf0/0x108 [spl]
Dec 2 15:32:14 raid3 kernel: dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs]
Dec 2 15:32:14 raid3 kernel: dsl_dataset_clone_swap_sync_impl+0x88e/0xa70 [zfs]
Dec 2 15:32:14 raid3 kernel: ? _raw_spin_unlock+0x15/0x30
Dec 2 15:32:14 raid3 kernel: ? dsl_dataset_hold_flags+0xb0/0x230 [zfs]
Dec 2 15:32:14 raid3 kernel: dmu_recv_end_sync+0x180/0x560 [zfs]
Dec 2 15:32:14 raid3 kernel: ? preempt_count_add+0x6a/0xa0
Dec 2 15:32:14 raid3 kernel: dsl_sync_task_sync+0xa5/0xf0 [zfs]
Dec 2 15:32:14 raid3 kernel: dsl_pool_sync+0x3d3/0x4e0 [zfs]
Dec 2 15:32:14 raid3 kernel: spa_sync+0x555/0xf60 [zfs]
Dec 2 15:32:14 raid3 kernel: ? _raw_spin_unlock+0x15/0x30
Dec 2 15:32:14 raid3 kernel: ? spa_txg_history_init_io+0xf3/0x110 [zfs]
Dec 2 15:32:14 raid3 kernel: txg_sync_thread+0x227/0x3e0 [zfs]
Dec 2 15:32:14 raid3 kernel: ? txg_fini+0x260/0x260 [zfs]
Dec 2 15:32:14 raid3 kernel: ? __thread_exit+0x20/0x20 [spl]
Dec 2 15:32:14 raid3 kernel: thread_generic_wrapper+0x57/0x70 [spl]
Dec 2 15:32:14 raid3 kernel: kthread+0xe6/0x110
Dec 2 15:32:14 raid3 kernel: ? kthread_complete_and_exit+0x20/0x20
Dec 2 15:32:14 raid3 kernel: ret_from_fork+0x1f/0x30
Dec 2 15:32:14 raid3 kernel: </TASK>
ZFS is trying to deactivate a feature that is not active and it very much does not like that.
Do any variations of the send command without -L, -e or -c avoid the panic? Knowing if turning any of them off makes this work would help in debugging this.
The error message talks about the large_blocks feature, although it is only enabled and not active at both the sender and receiver. The -L flag is probably responsible for triggering this. Could it be related to https://github.com/openzfs/zfs/pull/13782 ?
I will not be able to try anything until tomorrow. The machine is at the office and has currently experienced this issue and I am unable to ssh in.
I can confirm that this happens also to me. zfs core dump during send/receive:
It happens here if sender has resordsize 128k and receivers has recordsize 1M:
# zfs get recordsize zHome/home zstore/data/BACKUP/rakete_home/home
NAME PROPERTY VALUE SOURCE
zHome/home recordsize 128K inherited from zHome
zstore/data/BACKUP/rakete_home/home recordsize 1M inherited from zstore
syncoid --sendoptions="L" --mbuffer-size=512M --no-sync-snap zHome/home zstore/data/BACKUP/rakete_home/home
NEWEST SNAPSHOT: 2022-12-03--23:23
Sending incremental zHome/home@2022-11-29--08:05 ... 2022-12-03--23:23 (~ 34 KB):
0,00 B 0:00:00 [0,00 B/s] [> ] 0%
cannot receive: failed to read from stream
CRITICAL ERROR: zfs send -L -I 'zHome/home'@'2022-11-29--08:05' 'zHome/home'@'2022-12-03--23:23' | mbuffer -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 35264 | zfs receive -s -F 'zstore/data/BACKUP/rakete_home/home' 2>&1 failed: 256 at /usr/bin/syncoid line 817.
I tried various permuations of zfs send flags and here are the results
zfs send -RLec -I snapshot_name : crashes
zfs send -Rec -I snapshot_name : crashes
zfs send -Rc -I snapshot_name : crashes
zfs send -R -I snapshot_name : works
I use zstd-1 compression on all my datasets. compression is inherited from the top dataset. I use the default recordsize for all datasets except for one that uses 8K
Does zfs send -RLe -I snapshot_name work? What I am really wondering is which flags (just 1 I hope) cause crashes to narrow the scope that needs to be debugged.
I have left the office now and won't be there for a while. Most of us work from home. Next time someone goes to the office I can try it for you.
just got the same panic using only zfs send -R -I snapshot_name so trying send -RLe will not shed any light on the issue. Here is the full stacktrace:
[Sun Dec 4 16:52:10 2022] VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
[Sun Dec 4 16:52:10 2022] PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
[Sun Dec 4 16:52:10 2022] Showing stack for process 1627
[Sun Dec 4 16:52:10 2022] CPU: 7 PID: 1627 Comm: txg_sync Tainted: P OE 6.0.10-200.fc36.x86_64 #1
[Sun Dec 4 16:52:10 2022] Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD3H/Z87X-UD3H-CF, BIOS 10b 06/12/2014
[Sun Dec 4 16:52:10 2022] Call Trace:
[Sun Dec 4 16:52:10 2022] <TASK>
[Sun Dec 4 16:52:10 2022] dump_stack_lvl+0x44/0x5c
[Sun Dec 4 16:52:10 2022] spl_panic+0xf0/0x108 [spl]
[Sun Dec 4 16:52:10 2022] dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs]
[Sun Dec 4 16:52:10 2022] dsl_dataset_clone_swap_sync_impl+0x88e/0xa70 [zfs]
[Sun Dec 4 16:52:10 2022] ? _raw_spin_unlock+0x15/0x30
[Sun Dec 4 16:52:10 2022] ? dsl_dataset_hold_flags+0xb0/0x230 [zfs]
[Sun Dec 4 16:52:10 2022] dmu_recv_end_sync+0x180/0x560 [zfs]
[Sun Dec 4 16:52:10 2022] ? preempt_count_add+0x6a/0xa0
[Sun Dec 4 16:52:10 2022] dsl_sync_task_sync+0xa5/0xf0 [zfs]
[Sun Dec 4 16:52:10 2022] dsl_pool_sync+0x3d3/0x4e0 [zfs]
[Sun Dec 4 16:52:10 2022] spa_sync+0x555/0xf60 [zfs]
[Sun Dec 4 16:52:10 2022] ? _raw_spin_unlock+0x15/0x30
[Sun Dec 4 16:52:10 2022] ? spa_txg_history_init_io+0xf3/0x110 [zfs]
[Sun Dec 4 16:52:10 2022] txg_sync_thread+0x227/0x3e0 [zfs]
[Sun Dec 4 16:52:10 2022] ? txg_fini+0x260/0x260 [zfs]
[Sun Dec 4 16:52:10 2022] ? __thread_exit+0x20/0x20 [spl]
[Sun Dec 4 16:52:10 2022] thread_generic_wrapper+0x57/0x70 [spl]
[Sun Dec 4 16:52:10 2022] kthread+0xe6/0x110
[Sun Dec 4 16:52:10 2022] ? kthread_complete_and_exit+0x20/0x20
[Sun Dec 4 16:52:10 2022] ret_from_fork+0x1f/0x30
[Sun Dec 4 16:52:10 2022] </TASK>
I was afraid that might happen. Thanks for confirming it.
Am Sonntag, dem 04.12.2022 um 05:38 -0800 schrieb George Amanakis:
As far as I can see #13782 is not included in 2.1.7. Is it possible to retry with current master or cherry-pick that commit on 2.1.7 and try again? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
This patch is already part of zfs 2.1.7
See at the bottom of this page: https://github.com/openzfs/zfs/compare/zfs-2.1.6...zfs-2.1.7
Is this reproducible on 2.1.6?
No, I've been running 2.1.6 with zfs send -RLec since it's release. The script has been running unchanged for about 5 years and I have never experienced this issue. It happened instantly once I updated to zfs 2.1.7
Could you try reverting c8d2ab05e1e6bf21185723fabdebbe5bb3374381 to see if that makes it stop? I do not see any thing else that changed in 2.1.7 that could possibly be related to this.
Thanks, I wil try reverting https://github.com/openzfs/zfs/commit/c8d2ab05e1e6bf21185723fabdebbe5bb3374381 It might be a couple of days before I can try it.
I've reverted https://github.com/openzfs/zfs/commit/c8d2ab05e1e6bf21185723fabdebbe5bb3374381 and tried the zfs send again. Unfortunately my pool seems to be in a weird state and fails with lots of errors about snapshots not existing. After the failure, on the recv end the receive dataset has been renamed to xxxrecv-xxxxx-1. I tried renaming the dataset back to raid3/raid2 but each time I did the zfs send it was renamed to raid3/raid2recv-xxxxx-1 again.
I am running a pool scrub and will try again once it has finished.
The sending pool is raid2 and it is recevied into raid3/raid2 on the receiving end.
All the snapshots mentioned in the errors below exist on the receving end, they've just be renamed to raid3/raid2recv-65018-1
Here is an extract of some of the errors.
cannot open 'raid3/raid2/[email protected]': dataset does not exist
cannot open 'raid3/raid2/git-migration': dataset does not exist
cannot receive: specified fs (raid3/raid2) does not exist
attempting destroy raid3/raid2
failed - trying rename raid3/raid2 to raid3/raid2recv-65018-1
local fs raid3/raid2/home does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
local fs raid3/raid2/git does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
local fs raid3/raid2/databases does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
local fs raid3/raid2/products does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
I did a full test again, but reverting the patch did not help. At least, what I found, reverting this patch prevents a complete freeze of the pool. But I still see coredumps during send/receive
This is the patch I am reverting:
From 1af2632e439156f5edd404b35e5c9e444b607733 Mon Sep 17 00:00:00 2001
From: George Amanakis <[email protected]>
Date: Tue, 30 Aug 2022 22:15:56 +0200
Subject: [PATCH] Fix setting the large_block feature after receiving a
snapshot
We are not allowed to dirty a filesystem when done receiving
a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
not be set on that filesystem since the filesystem is not on
dp_dirty_datasets, and a subsequent encrypted raw send will fail.
Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature
needs to be activated and do so if appropriate.
Signed-off-by: George Amanakis <[email protected]>
---
module/zfs/dsl_dataset.c | 15 ++++
tests/runfiles/common.run | 2 +-
tests/zfs-tests/tests/Makefile.am | 1 +
.../rsend/send_raw_large_blocks.ksh | 78 +++++++++++++++++++
4 files changed, 95 insertions(+), 1 deletion(-)
create mode 100755 tests/zfs-tests/tests/functional/rsend/send_raw_large_blocks.ksh
diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c
index c7577fc584a..4da4effca60 100644
--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -1760,6 +1760,21 @@ dsl_dataset_snapshot_sync_impl(dsl_dataset_t *ds, const char *snapname,
}
}
+ /*
+ * We are not allowed to dirty a filesystem when done receiving
+ * a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
+ * not be set and a subsequent encrypted raw send will fail. Hence
+ * activate this feature if needed here.
+ */
+ for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
+ if (zfeature_active(f, ds->ds_feature_activation[f]) &&
+ !(zfeature_active(f, ds->ds_feature[f]))) {
+ dsl_dataset_activate_feature(dsobj, f,
+ ds->ds_feature_activation[f], tx);
+ ds->ds_feature[f] = ds->ds_feature_activation[f];
+ }
+ }
+
ASSERT3U(ds->ds_prev != 0, ==,
dsl_dataset_phys(ds)->ds_prev_snap_obj != 0);
if (ds->ds_prev) {
With stock zfs 2.1.7 (incl. this patch) my pool completely freezes when doing send / receive:
syncoid --sendoptions="L" --mbuffer-size=512M --no-sync-snap zHome/home zstore/data/BACKUP/rakete_home/home
NEWEST SNAPSHOT: 2022-12-09--10:47
Sending incremental zHome/home@2022-12-09--10:32 ... 2022-12-09--10:47 (~ 4 KB):
0,00 B 0:00:00 [0,00 B/s] [> ] 0%
cannot receive: failed to read from stream
CRITICAL ERROR: zfs send -L -I 'zHome/home'@'2022-12-09--10:32' 'zHome/home'@'2022-12-09--10:47' | mbuffer -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 4720 | zfs receive -s -F 'zstore/data/BACKUP/rakete_home/home' 2>&1 failed: 256 at /usr/bin/syncoid line 817.
gives the following in the log:
Dez 09 10:35:47 rakete kernel: INFO: task txg_sync:3044 blocked for more than 122 seconds.
Dez 09 10:35:47 rakete kernel: Tainted: P OE 6.0.11-zen1-1-zen #1
Dez 09 10:35:47 rakete kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dez 09 10:35:47 rakete kernel: task:txg_sync state:D stack: 0 pid: 3044 ppid: 2 flags:0x00004000
Dez 09 10:35:47 rakete kernel: Call Trace:
Dez 09 10:35:47 rakete kernel: <TASK>
Dez 09 10:35:47 rakete kernel: __schedule+0xb43/0x1350
Dez 09 10:35:47 rakete kernel: schedule+0x5e/0xd0
Dez 09 10:35:47 rakete kernel: spl_panic+0x10a/0x10c [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel: dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: dsl_dataset_clone_swap_sync_impl+0x90b/0xe30 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: dsl_dataset_rollback_sync+0x109/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: dsl_sync_task_sync+0xac/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: dsl_pool_sync+0x40d/0x5c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: spa_sync+0x56c/0xf90 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: ? spa_txg_history_init_io+0x193/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: txg_sync_thread+0x22b/0x3f0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: ? txg_wait_open+0xf0/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: ? __thread_exit+0x20/0x20 [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel: thread_generic_wrapper+0x5e/0x70 [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel: kthread+0xde/0x110
Dez 09 10:35:47 rakete kernel: ? kthread_complete_and_exit+0x20/0x20
Dez 09 10:35:47 rakete kernel: ret_from_fork+0x22/0x30
Dez 09 10:35:47 rakete kernel: </TASK>
I see multiple of those blocked task messages for various tasks. The pool is then frozen. Reboot takes ages and I had to force reboot with REISUB.
When I reverted the patch I did not experience freezes but the same error and coredumps. Unfortuanlety I can not provide a coredump because my system says: Resource limits disable core dumping for process 82335 (zfs) And I do not know why that is.
Anyways, interesting is also that this error resp. coredump does not happen every time. Approximately 1 out of 4 send/receive processes execute fine.
And just to reassure you: This is a new bug in 2.1.7. I am using this backup process with syncoid since many years without any problem. Reverting back to zfs 2.1.6 gives me a stable system again.
Could you do a git bisect with good being 2.1.6 and bad 2.1.7?
On Fri, Dec 9, 2022, 11:51 AM mabod @.***> wrote:
I did a full test again, but reverting the patch did not help. At least, what I found, reverting this patch prevents a complete freeze of the pool. But I still see coredumps during send/receive
This is the patch I am reverting:
From 1af2632e439156f5edd404b35e5c9e444b607733 Mon Sep 17 00:00:00 2001 From: George Amanakis @.***> Date: Tue, 30 Aug 2022 22:15:56 +0200 Subject: [PATCH] Fix setting the large_block feature after receiving a snapshot
We are not allowed to dirty a filesystem when done receiving a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will not be set on that filesystem since the filesystem is not on dp_dirty_datasets, and a subsequent encrypted raw send will fail. Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature needs to be activated and do so if appropriate.
Signed-off-by: George Amanakis @.***>
module/zfs/dsl_dataset.c | 15 ++++ tests/runfiles/common.run | 2 +- tests/zfs-tests/tests/Makefile.am | 1 + .../rsend/send_raw_large_blocks.ksh | 78 +++++++++++++++++++ 4 files changed, 95 insertions(+), 1 deletion(-) create mode 100755 tests/zfs-tests/tests/functional/rsend/send_raw_large_blocks.ksh
diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c index c7577fc584a..4da4effca60 100644 --- a/module/zfs/dsl_dataset.c +++ b/module/zfs/dsl_dataset.c @@ -1760,6 +1760,21 @@ dsl_dataset_snapshot_sync_impl(dsl_dataset_t *ds, const char *snapname, } }
- /*
- We are not allowed to dirty a filesystem when done receiving
- a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
- not be set and a subsequent encrypted raw send will fail. Hence
- activate this feature if needed here.
- */
- for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
if (zfeature_active(f, ds->ds_feature_activation[f]) &&!(zfeature_active(f, ds->ds_feature[f]))) {dsl_dataset_activate_feature(dsobj, f,ds->ds_feature_activation[f], tx);ds->ds_feature[f] = ds->ds_feature_activation[f];}- }
- ASSERT3U(ds->ds_prev != 0, ==, dsl_dataset_phys(ds)->ds_prev_snap_obj != 0); if (ds->ds_prev) {
With stock zfs 2.1.7 (incl. this patch) my pool completely freezes when doing send / receive:
syncoid --sendoptions="L" --mbuffer-size=512M --no-sync-snap zHome/home zstore/data/BACKUP/rakete_home/home NEWEST SNAPSHOT: 2022-12-09--10:47 Sending incremental @.***:32 ... 2022-12-09--10:47 (~ 4 KB): 0,00 B 0:00:00 [0,00 B/s] [> ] 0% cannot receive: failed to read from stream CRITICAL ERROR: zfs send -L -I 'zHome/home'@'2022-12-09--10:32' 'zHome/home'@'2022-12-09--10:47' | mbuffer -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 4720 | zfs receive -s -F 'zstore/data/BACKUP/rakete_home/home' 2>&1 failed: 256 at /usr/bin/syncoid line 817.
gives the following in the log:
Dez 09 10:35:47 rakete kernel: INFO: task txg_sync:3044 blocked for more than 122 seconds. Dez 09 10:35:47 rakete kernel: Tainted: P OE 6.0.11-zen1-1-zen #1 Dez 09 10:35:47 rakete kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dez 09 10:35:47 rakete kernel: task:txg_sync state:D stack: 0 pid: 3044 ppid: 2 flags:0x00004000 Dez 09 10:35:47 rakete kernel: Call Trace: Dez 09 10:35:47 rakete kernel: <TASK> Dez 09 10:35:47 rakete kernel: __schedule+0xb43/0x1350 Dez 09 10:35:47 rakete kernel: schedule+0x5e/0xd0 Dez 09 10:35:47 rakete kernel: spl_panic+0x10a/0x10c [spl 29da18a0fede4076df583dd8fb83790522bfe897] Dez 09 10:35:47 rakete kernel: dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: dsl_dataset_clone_swap_sync_impl+0x90b/0xe30 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: dsl_dataset_rollback_sync+0x109/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: dsl_sync_task_sync+0xac/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: dsl_pool_sync+0x40d/0x5c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: spa_sync+0x56c/0xf90 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: ? spa_txg_history_init_io+0x193/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: txg_sync_thread+0x22b/0x3f0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: ? txg_wait_open+0xf0/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: ? __thread_exit+0x20/0x20 [spl 29da18a0fede4076df583dd8fb83790522bfe897] Dez 09 10:35:47 rakete kernel: thread_generic_wrapper+0x5e/0x70 [spl 29da18a0fede4076df583dd8fb83790522bfe897] Dez 09 10:35:47 rakete kernel: kthread+0xde/0x110 Dez 09 10:35:47 rakete kernel: ? kthread_complete_and_exit+0x20/0x20 Dez 09 10:35:47 rakete kernel: ret_from_fork+0x22/0x30 Dez 09 10:35:47 rakete kernel: </TASK>
I see multiple of those blocked task messages for various tasks. The pool is then frozen. Reboot takes ages and I had to force reboot with REISUB.
When I reverted the patch I did not experience freezes but the same error and coredumps. Unfortuanlety I can not provide a coredump because my system says: Resource limits disable core dumping for process 82335 (zfs) And I do not know why that is.
Anyways, interesting is also that this error resp. coredump does not happen every time. Approximately 1 out of 4 send/receive processes execute fine.
And just to reassure you: This is a new bug in 2.1.7. I am using this backup process with syncoid since many years without any problem. Reverting back to zfs 2.1.6 gives me a stable system again.
— Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/14252#issuecomment-1344149681, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2Y2IOXPMFSF6EKS3ZV4KLWMMFM5ANCNFSM6AAAAAASSJ2TW4 . You are receiving this because you commented.Message ID: @.***>
I could do that but I need some help. I am not a programmer resp. git expert. I need guidance on this.
I have a couple of questions to start with:
- do I need to do the bisect for zfs-dkms and zfs-utils or just for zfs-utils? (I am on endeavouros)
- How could I incorporate the git bisect into the pkgbuild process so that I get a new package for each bisect? Or should I just compile the soruce separetly and then only use the resulting binary?
- What would be last good commit to use?
The straightforward way would be to compile manually:
- git bisect start
- git bisect good zfs-2.1.6
- git bisect bad zfs-2.1.7
- compile
- load the modules with "./scripts/zfs.sh -v" and set the PATH to the main OpenZFS source directory
- test, enter accordingly "git bisect good" or "git bisect bad"
- goto step 4
step 5 does not work for me. I have the module running and I have active pools. How do I reload the zfs module in a running live environment? And what do you mean with "set the PATH to the main OpenZFS source directory" ?
./scripts/zfs.sh -vu should unload the modules. Make sure you have exported all pools.
I have done some comprehensive testing after reverting https://github.com/openzfs/zfs/commit/c8d2ab05e1e6bf21185723fabdebbe5bb3374381 and can report that I am no longer getting the zfs recv panic in dsl_dataset_deactivate_feature_impl. I tried lots incremental sends using the full zfs send -RLec and also a full send. That definitely fixed it. @mabod your coredump issue is probably related to another commit.
@peterska Would you be willing to try a possible fix?
@gamanakis Let me review the patch and if it looks ok to me I will test it. Will not be able to test it until Monday Sydney time.
./scripts/zfs.sh -vu should unload the modules. Make sure you have exported all pools.
I exported all pools. I rebooted. I verfied with the 'zpool status' command that no polls are importedt. But when I excecute "./scripts/zfs.sh -vu" it fails by saying "module is still in use". Why is that?
@mabod your coredump issue is probably related to another commit.
yes, must be. I checked with the reverted pacth again. The coredump does exist. no doubt. May be it has to do with encryption. The datasets are encrypted.
@peterska I am thinking that in your case the code inserted in c8d2ab0 runs in open context, which shouldn't be the case. I am trying to reproduce this on 2.1.7 with no success. This is my latest iteration:
64 log_must zpool create -f pool_lb $TESTDIR/vdev_a
65
66 log_must zfs create pool_lb/fs
67 log_must dd if=/dev/urandom of=$file bs=1024 count=1024
68 log_must zfs snapshot pool_lb/fs@snap1
69
70 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
71 log_must zfs snapshot pool_lb/fs@snap2
72 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
73 log_must zfs snapshot pool_lb/fs@snap3
74 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
75 log_must zfs snapshot pool_lb/fs@snap4
76 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
77 log_must zfs snapshot pool_lb/fs@snap5
78 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
79 log_must zfs snapshot pool_lb/fs@snap6
80 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
81 log_must zfs snapshot pool_lb/fs@snap7
82 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
83 log_must zfs snapshot pool_lb/fs@snap8
84 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
85 log_must zfs snapshot pool_lb/fs@snap9
86
87 log_must eval "zfs send -RLec pool_lb/fs@snap5 > $backup"
88 log_must eval "zfs recv pool_lb/testfs5 < $backup"
89
90 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
91 log_must zfs snapshot pool_lb/testfs5@snap10
92 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
93 log_must zfs snapshot pool_lb/testfs5@snap20
94 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
95 log_must zfs snapshot pool_lb/testfs5@snap30
96 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
97 log_must zfs snapshot pool_lb/testfs5@snap40
98 log_must eval "zfs send -RLec -I @snap5 pool_lb/fs@snap9 > $backup"
99 log_must eval "zfs recv -F pool_lb/testfs5 < $backup"
100
101 log_must eval "zfs send -RLec pool_lb/testfs5@snap9 > $raw_backup"
The patch I was thinking about is the following:
diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c
index 71d876cae..da49ca94b 100644
--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -1753,13 +1753,17 @@ dsl_dataset_snapshot_sync_impl(dsl_dataset_t *ds, const char *snapname,
* not be set and a subsequent encrypted raw send will fail. Hence
* activate this feature if needed here.
*/
- for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
- if (zfeature_active(f, ds->ds_feature_activation[f]) &&
- !(zfeature_active(f, ds->ds_feature[f]))) {
- dsl_dataset_activate_feature(dsobj, f,
- ds->ds_feature_activation[f], tx);
- ds->ds_feature[f] = ds->ds_feature_activation[f];
+ if (dmu_tx_is_syncing(tx)) {
+ for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
+ if (zfeature_active(f, ds->ds_feature_activation[f]) &&
+ !(zfeature_active(f, ds->ds_feature[f]))) {
+ dsl_dataset_activate_feature(dsobj, f,
+ ds->ds_feature_activation[f], tx);
+ ds->ds_feature[f] = ds->ds_feature_activation[f];
+ }
}
+ } else {
+ cmn_err(CE_NOTE, "not syncing context!");
}
ASSERT3U(ds->ds_prev != 0, ==,
This runs the code inserted in c8d2ab0 only in syncing context, otherwise it dumps a note in dmesg. It would help if you could end up with a script reproducing the bug in a pool I have access to, or if you could run patch above.
@gamanakis the patch looks ok to me. I will try it later today and let you know the results.
To help you reproduce the panic, here are the non default settings of my pool and dataset:
pool:
NAME PROPERTY VALUE SOURCE
raid2 ashift 12 local
raid2 autotrim on local
raid2 feature@async_destroy enabled local
raid2 feature@empty_bpobj active local
raid2 feature@lz4_compress active local
raid2 feature@multi_vdev_crash_dump enabled local
raid2 feature@spacemap_histogram active local
raid2 feature@enabled_txg active local
raid2 feature@hole_birth active local
raid2 feature@extensible_dataset active local
raid2 feature@embedded_data active local
raid2 feature@bookmarks enabled local
raid2 feature@filesystem_limits enabled local
raid2 feature@large_blocks enabled local
raid2 feature@large_dnode enabled local
raid2 feature@sha512 enabled local
raid2 feature@skein enabled local
raid2 feature@edonr enabled local
raid2 feature@userobj_accounting active local
raid2 feature@encryption enabled local
raid2 feature@project_quota active local
raid2 feature@device_removal enabled local
raid2 feature@obsolete_counts enabled local
raid2 feature@zpool_checkpoint enabled local
raid2 feature@spacemap_v2 active local
raid2 feature@allocation_classes enabled local
raid2 feature@resilver_defer enabled local
raid2 feature@bookmark_v2 enabled local
raid2 feature@redaction_bookmarks enabled local
raid2 feature@redacted_datasets enabled local
raid2 feature@bookmark_written enabled local
raid2 feature@log_spacemap active local
raid2 feature@livelist enabled local
raid2 feature@device_rebuild enabled local
raid2 feature@zstd_compress active local
raid2 feature@draid enabled local
dataset:
NAME PROPERTY VALUE SOURCE
raid2 compression zstd-1 local
raid2 atime off local
raid2 xattr sa local
raid2 relatime on local
@gamanakis I finally got a chance to test out your patch today. I could not apply it directly, I guess github changed some spaces/tabs, so I patched it manually. The bottom of this comment shows the surrounding code so I can convince you I applied it correctly.
I ran about 8 zfs sned -RLec commands and there were no lockups/panics. I have left the new code running on my server so we can detect any issues as it goes about it daily backup tasks.
for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
if (zfeature_active(f, ds->ds_feature[f])) {
dsl_dataset_activate_feature(dsobj, f,
ds->ds_feature[f], tx);
}
}
/*
* We are not allowed to dirty a filesystem when done receiving
* a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
* not be set and a subsequent encrypted raw send will fail. Hence
* activate this feature if needed here.
*/
if (dmu_tx_is_syncing(tx)) {
for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
if (zfeature_active(f, ds->ds_feature_activation[f]) &&
!(zfeature_active(f, ds->ds_feature[f]))) {
dsl_dataset_activate_feature(dsobj, f,
ds->ds_feature_activation[f], tx);
ds->ds_feature[f] = ds->ds_feature_activation[f];
}
}
} else {
cmn_err(CE_NOTE, "not syncing context!");
}
ASSERT3U(ds->ds_prev != 0, ==,
dsl_dataset_phys(ds)->ds_prev_snap_obj != 0);
I am back in testing this. I finally managed to get it working by reverting the commit https://github.com/openzfs/zfs/commit/c8d2ab05e1e6bf21185723fabdebbe5bb3374381 I don't know what went wrong during my first tests, but now I succeeded and I have a working 2.1.7 installation. I tested more than a dozen send/receive and had no hang or core dump.
But the new patch from @gamanakis does not work. With that my zfs freezes again.
Dez 13 07:03:28 rakete kernel: VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
Dez 13 07:03:28 rakete kernel: PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
Dez 13 07:03:28 rakete kernel: Showing stack for process 2971
Dez 13 07:03:28 rakete kernel: CPU: 15 PID: 2971 Comm: txg_sync Tainted: P W OE 6.1.0-zen1-1-zen #1 7b46ab8a09476e11acb0e7c950>
Dez 13 07:03:28 rakete kernel: Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ULTRA/X570 AORUS ULTRA, BIOS F36d 07/20/2022
Dez 13 07:03:28 rakete kernel: Call Trace:
Dez 13 07:03:28 rakete kernel: <TASK>
Dez 13 07:03:28 rakete kernel: dump_stack_lvl+0x48/0x60
Dez 13 07:03:28 rakete kernel: spl_panic+0xf4/0x10c [spl 44dc46be6c1ef3141abc1ca53c67e8b523954926]
Dez 13 07:03:28 rakete kernel: dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: dsl_dataset_clone_swap_sync_impl+0x8d4/0xe30 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: dsl_dataset_rollback_sync+0x109/0x1c0 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: dsl_sync_task_sync+0xac/0xf0 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: dsl_pool_sync+0x40d/0x5c0 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: spa_sync+0x56c/0xf90 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: ? spa_txg_history_init_io+0x193/0x1c0 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: txg_sync_thread+0x22b/0x3f0 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: ? cpumask_weight.constprop.0+0x20/0x20 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: ? __thread_exit+0x20/0x20 [spl 44dc46be6c1ef3141abc1ca53c67e8b523954926]
Dez 13 07:03:28 rakete kernel: thread_generic_wrapper+0x5e/0x70 [spl 44dc46be6c1ef3141abc1ca53c67e8b523954926]
Dez 13 07:03:28 rakete kernel: kthread+0xde/0x110
Dez 13 07:03:28 rakete kernel: ? kthread_complete_and_exit+0x20/0x20
Dez 13 07:03:28 rakete kernel: ret_from_fork+0x22/0x30
Dez 13 07:03:28 rakete kernel: </TASK>
I am using syncoid to do the send/receive and the only send options are "-L -I". The only receive options are "-s -F", like here:
zfs send -L -I 'zHome/home'@'2022-12-09--10:32' 'zHome/home'@'2022-12-09--10:47' | mbuffer -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 4720 | zfs receive -s -F 'zstore/data/BACKUP/rakete_home/home'
EDIT: All involved datasets are encrypted with a keyfile
It is pretty quiet now about this issue. Will a fix be considered for 2.1.8 release?