zfs send with large blocks is unreliable / data corruption
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Gentoo |
| Distribution Version | Live |
| Kernel Version | 6.12.51-gentoo |
| Architecture | x86_64 |
| OpenZFS Version | 2.3.4 |
| ECC RAM | Yes |
Describe the problem you're observing
Confusing messages from ZFS and broken zstreamdump.
Describe how to reproduce the problem
I don't know how reproducible it is:
I copied all data from one SSD pool to another. using zfs send -L -R -e -c ssd@move | zfs recv -F new
Destroyed original pool, renamed new to ssd, created 'new' using zpool create new -o ashift=12 -o compatibility=openzfs-2.0-linux raidz .... . Note the compatibility is lower, I wanted to do that for intemediate pool, but forgot, so it used all the features from ZFS 2.3.4.
When running the same migration command it failed 'zfs send -L -R -e -c ssd@move | zfs recv -F new' with:
cannot receive new filesystem stream: incompatible embedded data stream feature with encrypted receive.
I don't use ZFS encryption in neither source or destination pools. Most of datasets copied correctly, but 'ssd/home' didn't. When running without embed and compress flags:
$ zfs send -L -R ssd/home@move | zfs recv new/home
cannot receive new filesystem stream: invalid backup stream
Interestingly if just run zfs send -L -R -e -c ssd/home@move | zstreamdump - it produces valid stream with multiple child datasets (see attached). But, if I run:
zfs send -L ssd/home@move | zstreamdump
BEGIN record
hdrtype = 1
features = 800004
magic = 2f5bacbac
creation_time = 68e6180f
type = 2
flags = 0xc
toguid = 2e95a3e59984549e
fromguid = 0
toname = ssd/home@move
payloadlen = 0
invalid checksum
Incorrect checksum in record header.
Expected checksum = 3d53e21dc5327c23/99576f8530ce791c/940bce62d8b523c2/239a036d344a2d8
SUMMARY:
Total DRR_BEGIN records = 1 (0 bytes)
Total DRR_END records = 0 (0 bytes)
Total DRR_OBJECT records = 134126 (23634344 bytes)
Total DRR_FREEOBJECTS records = 7076 (0 bytes)
Total DRR_WRITE records = 216867 (12381282816 bytes)
Total DRR_WRITE_BYREF records = 0 (0 bytes)
Total DRR_WRITE_EMBEDDED records = 0 (0 bytes)
Total DRR_FREE records = 136992 (0 bytes)
Total DRR_SPILL records = 0 (0 bytes)
Total records = 495062
Total payload size = 12404917160 (0x2e36403a8)
Total header overhead = 154459344 (0x934dcd0)
Total stream length = 12559376816 (0x2ec98e1b0)
The stream randomly fails sometimes faster, sometimes longer, sometimes does not - produces a valid stream. The system is very stable, no CPU issues ever seen, no unexpected crashes in logs, so I can rule out general instability. Full replication stream with embedding and compression is produced correctly.
Scrub returns clean, while isolated diff gives errors, reading 'logs' gives io error. Looks like something got corrupted during previous send?
zfs diff ssd/home@2025-10-08 ssd/home@move
M /home/ivan
M /home/ivan/experiments/code/temperature-control/target/debug/temperature-server.d
M /home/ivan/.motd
Unable to determine path or stats for object 296192 in ssd/home@move: Input/output error
- /home/minecraft/server-irina/logs/2025-05-06-1.log.gz
Unable to determine path or stats for object 296193 in ssd/home@move: Input/output error
- /home/minecraft/server-irina/logs/2025-04-26-1.log.gz
M /home/ivan/experiments/code/temperature-control/server_state_v2.bin
Include any warning/errors/backtraces from the system logs
Something went terribly wrong. I deleted the datasets and restored from backup and everything seem to be fine, but now I'm getting:
[80183.934665] PANIC: zfs: accessing past end of object 4ea/b04c9 (size=259072 access=259068+4096)
[80183.934677] Showing stack for process 737938
[80183.934681] CPU: 27 UID: 1000 PID: 737938 Comm: ThreadPoolForeg Tainted: G T 6.12.52-x86_64-dirty #51
[80183.934682] Tainted: [T]=RANDSTRUCT
[80183.934683] Hardware name: ASUS System Product Name/TUF GAMING X670E-PLUS, BIOS 3287 09/20/2025
[80183.934684] Call Trace:
[80183.934686] <TASK>
[80183.934688] dump_stack_lvl+0x4b/0x70
[80183.934693] vcmn_err.cold+0x54/0x7f
[80183.934697] ? dbuf_find+0x23d/0x250
[80183.934698] ? arc_access+0x325/0x610
[80183.934699] zfs_panic_recover+0x61/0x70
[80183.934702] dmu_buf_hold_array_by_dnode+0x20c/0x680
[80183.934703] ? dnode_hold_impl+0xad6/0x1300
[80183.934704] dmu_write_uio_dnode+0x85/0x2c0
[80183.934705] ? dmu_object_set_blocksize+0x58/0x70
[80183.934707] zfs_write+0x4d1/0xcb0
[80183.934710] zpl_iter_write+0x10d/0x170
[80183.934711] vfs_write+0x230/0x3e0
[80183.934713] __x64_sys_pwrite64+0x76/0xb0
[80183.934714] do_syscall_64+0x52/0x120
[80183.934715] entry_SYSCALL_64_after_hwframe+0x50/0x58
[80183.934717] RIP: 0033:0x7fe5fde78572
[80183.934718] Code: 08 0f 85 01 49 ff ff 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08
0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 48 83 ec 08
[80183.934719] RSP: 002b:00007fe5ef58c3e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[80183.934721] RAX: ffffffffffffffda RBX: 00007fe5ef58f6c0 RCX: 00007fe5fde78572
[80183.934721] RDX: 0000000000001000 RSI: 0000278c04b48800 RDI: 0000000000000111
[80183.934722] RBP: 00007fe5ef58c490 R08: 0000000000000000 R09: 0000000000000000
[80183.934722] R10: 000000000003f3fc R11: 0000000000000246 R12: 0000000000000111
[80183.934723] R13: 0000278c000c8100 R14: 0000000000001000 R15: 000000000003f3fc
[80183.934724] </TASK>
....
[80364.904002] INFO: task ThreadPoolForeg:737938 blocked for more than 122 seconds.
[80364.904013] Tainted: G T 6.12.52-x86_64-dirty #51
[80364.904016] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[80364.904018] task:ThreadPoolForeg state:D stack:0 pid:737938 tgid:737888 ppid:737874 flags:0x00000002
[80364.904023] Call Trace:
[80364.904025] <TASK>
[80364.904029] __schedule+0x4ba/0xb50
[80364.904037] schedule+0x27/0xe0
[80364.904040] vcmn_err.cold+0x69/0x7f
[80364.904044] ? dbuf_find+0x23d/0x250
[80364.904047] ? arc_access+0x325/0x610
[80364.904050] zfs_panic_recover+0x61/0x70
[80364.904055] dmu_buf_hold_array_by_dnode+0x20c/0x680
[80364.904057] ? dnode_hold_impl+0xad6/0x1300
[80364.904060] dmu_write_uio_dnode+0x85/0x2c0
[80364.904062] ? dmu_object_set_blocksize+0x58/0x70
[80364.904066] zfs_write+0x4d1/0xcb0
[80364.904070] zpl_iter_write+0x10d/0x170
[80364.904073] vfs_write+0x230/0x3e0
[80364.904077] __x64_sys_pwrite64+0x76/0xb0
[80364.904079] do_syscall_64+0x52/0x120
[80364.904082] entry_SYSCALL_64_after_hwframe+0x50/0x58
[80364.904085] RIP: 0033:0x7fe5fde78572
[80364.904088] RSP: 002b:00007fe5ef58c3e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[80364.904091] RAX: ffffffffffffffda RBX: 00007fe5ef58f6c0 RCX: 00007fe5fde78572
[80364.904093] RDX: 0000000000001000 RSI: 0000278c04b48800 RDI: 0000000000000111
[80364.904094] RBP: 00007fe5ef58c490 R08: 0000000000000000 R09: 0000000000000000
[80364.904096] R10: 000000000003f3fc R11: 0000000000000246 R12: 0000000000000111
[80364.904098] R13: 0000278c000c8100 R14: 0000000000001000 R15: 000000000003f3fc
[80364.904100] </TASK>
[80487.784292] INFO: task txg_quiesce:5009 blocked for more than 122 seconds.
[80487.784301] Tainted: G T 6.12.52-x86_64-dirty #51
[80487.784303] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[80487.784305] task:txg_quiesce state:D stack:0 pid:5009 tgid:5009 ppid:2 flags:0x00004000
[80487.784309] Call Trace:
[80487.784311] <TASK>
[80487.784316] __schedule+0x4ba/0xb50
[80487.784322] schedule+0x27/0xe0
[80487.784325] cv_wait_common+0xdb/0x110
[80487.784328] ? prepare_to_wait_event+0x180/0x180
[80487.784332] txg_quiesce_thread+0x257/0x400
[80487.784335] ? txg_delay+0x170/0x170
[80487.784336] ? current_is_reclaim_thread+0x20/0x20
[80487.784339] thread_generic_wrapper+0x5a/0x70
[80487.784342] kthread+0xcf/0x100
[80487.784344] ? kthread_park+0x80/0x80
[80487.784346] ret_from_fork+0x31/0x50
[80487.784349] ? kthread_park+0x80/0x80
[80487.784351] ret_from_fork_asm+0x11/0x20
[80487.784354] </TASK>
[80487.784548] INFO: task ThreadPoolForeg:737938 blocked for more than 245 seconds.
[80487.784550] Tainted: G T 6.12.52-x86_64-dirty #51
[80487.784552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[80487.784553] task:ThreadPoolForeg state:D stack:0 pid:737938 tgid:737888 ppid:737874 flags:0x00000002
The system stays functional though. I believe the same failure happens both in 2.2.8 and 2.3.4, this one is from 2.3.4. Something definitely went wrong when I was moving data from SSD to SSD two times. I don't know if the corruption spread to the HDD archive, ECC RAM didn't seem to help me. :(
Actually, I don't know, might be a different unrelated problem. I think I identified the file in question and it seem pretty short lived (already overwritten after reboot) and can be read without issues. Scrub returned clean:
$ find /home/ivan/.cache -inum 722121
/home/ivan/.cache/google-chrome-config/Default/History-journal
I don't know what happened with the data corruption, but I restored from backup and everything was working fine for a while until I was trying to do another data migration. Still the same errors. For some datasets I'm getting again rzstreamdump corruption at random points ~100G:
$ zfs send ssd/home@2025-10-25 | zstreamdump
END checksum = 4f9578acb6b0bb42/fa1b021aea65d39f/4193763805b65e17/a954189c504e795f
$ zfs send ssd/home@2025-10-25 | zstreamdump
Incorrect checksum in record header.
Expected checksum = 4d5a3d2d9634ccdb/a1236048a739b40e/14b0608e84caf759/7035bd097ed11a5
For much larger one ~1T - no corruption. $ zfs send -L ssd/games/SteamLibrary@2025-10-25 | zstreamdump
The same happens on 2.2.8. I have data replicated from striped SSDs to a HDD raidz. I can read the same snapshot from there without an issue and it passed zstreamdump fine.
No data errors in zpool scrub. Any hints on how can I debug that?
zfs_arc_min=8G zfs_arc_max=100G zfs.l2arc_noprefetch=0 zfs_txg_timeout=120 spl.spl_taskq_thread_bind=0 spl.spl_taskq_thread_priority=0 zfs.l2arc_rebuild_blocks_min_l2size=1 zfs.zfs_dmu_offset_next_sync=0 zfs.zfs_bclone_enabled=0 zfs.zfs_dio_enabled=0 zfs.zfs_bclone_wait_dirty=0
$ zfs get all ssd/home
NAME PROPERTY VALUE SOURCE
ssd/home type filesystem -
ssd/home creation Thu Oct 16 21:20 2025 -
ssd/home used 108G -
ssd/home available 1.21T -
ssd/home referenced 72.7G -
ssd/home compressratio 1.40x -
ssd/home mounted yes -
ssd/home quota none default
ssd/home reservation none default
ssd/home recordsize 128K default
ssd/home mountpoint /home local
ssd/home sharenfs off default
ssd/home checksum on default
ssd/home compression on inherited from ssd
ssd/home atime on default
ssd/home devices on default
ssd/home exec on default
ssd/home setuid on default
ssd/home readonly off default
ssd/home zoned off default
ssd/home snapdir hidden default
ssd/home aclmode discard inherited from ssd
ssd/home aclinherit restricted default
ssd/home createtxg 125 -
ssd/home canmount on default
ssd/home xattr on inherited from ssd
ssd/home copies 1 default
ssd/home version 5 -
ssd/home utf8only off -
ssd/home normalization none -
ssd/home casesensitivity sensitive -
ssd/home vscan off default
ssd/home nbmand off default
ssd/home sharesmb off default
ssd/home refquota none default
ssd/home refreservation none default
ssd/home guid 17780824250341955445 -
ssd/home primarycache all default
ssd/home secondarycache all default
ssd/home usedbysnapshots 277M -
ssd/home usedbydataset 72.7G -
ssd/home usedbychildren 35.0G -
ssd/home usedbyrefreservation 0B -
ssd/home logbias latency default
ssd/home objsetid 1681 -
ssd/home dedup off default
ssd/home mlslabel none default
ssd/home sync disabled inherited from ssd
ssd/home dnodesize legacy default
ssd/home refcompressratio 1.50x -
ssd/home written 24.1M -
ssd/home logicalused 147G -
ssd/home logicalreferenced 106G -
ssd/home volmode default default
ssd/home filesystem_limit none default
ssd/home snapshot_limit none default
ssd/home filesystem_count none default
ssd/home snapshot_count none default
ssd/home snapdev hidden default
ssd/home acltype off inherited from ssd
ssd/home context none default
ssd/home fscontext none default
ssd/home defcontext none default
ssd/home rootcontext none default
ssd/home relatime on inherited from ssd
ssd/home redundant_metadata all default
ssd/home overlay on default
ssd/home encryption off default
ssd/home keylocation none default
ssd/home keyformat none default
ssd/home pbkdf2iters 0 default
ssd/home special_small_blocks 0 default
ssd/home snapshots_changed Sat Oct 25 18:47:54 2025 -
ssd/home prefetch all default
$ zpool get all ssd
NAME PROPERTY VALUE SOURCE
ssd size 3.19T -
ssd capacity 58% -
ssd altroot - default
ssd health ONLINE -
ssd guid 16136043115540145169 -
ssd version - default
ssd bootfs - default
ssd delegation on default
ssd autoreplace off default
ssd cachefile - default
ssd failmode wait default
ssd listsnapshots off default
ssd autoexpand off default
ssd dedupratio 1.00x -
ssd free 1.31T -
ssd allocated 1.88T -
ssd readonly off -
ssd ashift 12 local
ssd comment - default
ssd expandsize - -
ssd freeing 0 -
ssd fragmentation 4% -
ssd leaked 0 -
ssd multihost off default
ssd checkpoint - -
ssd load_guid 12418271514860811335 -
ssd autotrim on local
ssd compatibility zfs2.0-blake local
ssd bcloneused 0 -
ssd bclonesaved 0 -
ssd bcloneratio 1.00x -
ssd feature@async_destroy enabled local
ssd feature@empty_bpobj active local
ssd feature@lz4_compress active local
ssd feature@multi_vdev_crash_dump enabled local
ssd feature@spacemap_histogram active local
ssd feature@enabled_txg active local
ssd feature@hole_birth active local
ssd feature@extensible_dataset active local
ssd feature@embedded_data active local
ssd feature@bookmarks active local
ssd feature@filesystem_limits enabled local
ssd feature@large_blocks active local
ssd feature@large_dnode active local
ssd feature@sha512 enabled local
ssd feature@skein enabled local
ssd feature@edonr enabled local
ssd feature@userobj_accounting active local
ssd feature@encryption active local
ssd feature@project_quota active local
ssd feature@device_removal enabled local
ssd feature@obsolete_counts enabled local
ssd feature@zpool_checkpoint enabled local
ssd feature@spacemap_v2 active local
ssd feature@allocation_classes enabled local
ssd feature@resilver_defer enabled local
ssd feature@bookmark_v2 active local
ssd feature@redaction_bookmarks enabled local
ssd feature@redacted_datasets enabled local
ssd feature@bookmark_written active local
ssd feature@log_spacemap active local
ssd feature@livelist enabled local
ssd feature@device_rebuild enabled local
ssd feature@zstd_compress enabled local
ssd feature@draid disabled local
ssd feature@zilsaxattr disabled local
ssd feature@head_errlog disabled local
ssd feature@blake3 enabled local
ssd feature@block_cloning disabled local
ssd feature@vdev_zaps_v2 disabled local
Example of verbose output of zstreamdump -v 1.
FREE object = 295739 offset = 18446744073709551104 length = 511
OBJECT object = 295740 type = 19 bonustype = 44 blksz = 131072 bonuslen = 192 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 295740 offset = 0 length = -1
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 0 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 131072 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 262144 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 393216 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 524288 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 655360 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 786432 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = f007f00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
FREE object = 295740 offset = 917504 length = -1048576
WRITE object = 295740 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 18446744073709420544 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
invalid checksum
Incorrect checksum in record header.
Expected checksum = 3975d21b0ab6c42a/da786bd720cdf74d/ff73fb46e8bd66dd/9b1dc97e2fc41bf
- Successful, no checksum errors
WRITE object = 296003 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 5111808 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = f00c700ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
FREE object = 296003 offset = 5242880 length = -5373952
WRITE object = 296003 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 18446744073709420544 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
OBJECT object = 296004 type = 19 bonustype = 44 blksz = 131072 bonuslen = 192 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 296004 offset = 0 length = -1
WRITE object = 296004 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 0 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = f002f00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 296004 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 131072 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = f002f00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
FREE object = 296004 offset = 262144 length = -393216
WRITE object = 296004 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 18446744073709420544 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
Incorrect checksum in record header.
Expected checksum = 3c0ec487f9414230/9f1d025594c06c71/159619241d1d0704/297fe445157a589b
Looks like offset randomly becomes 0xfffffffffffe0000 (18446744073709420544) or -131072 (-128k): WRITE object = 295740 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 18446744073709420544 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000 WRITE object = 296004 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 18446744073709420544 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
Looks like data corruption indeed:
$ cat minecraft/server-irina/world/region/r.-1.-4.mca
Segmentation fault
dmesg:
[ 114.915734] usercopy: Kernel memory exposure attempt detected from SLUB object 'zio_data_buf_512' (offset 0, size 74080)!
[ 114.915752] ------------[ cut here ]------------
[ 114.915754] kernel BUG at mm/usercopy.c:102!
[ 114.915759] Oops: invalid opcode: 0000 [#1] PREEMPT SMP
[ 114.915762] CPU: 11 UID: 0 PID: 10766 Comm: cat Tainted: G T 6.12.53-x86_64-dirty #81
[ 114.915765] Tainted: [T]=RANDSTRUCT
[ 114.915766] Hardware name: ASUS System Product Name/TUF GAMING X670E-PLUS, BIOS 3287 09/20/2025
[ 114.915768] RIP: 0010:usercopy_abort+0x78/0x7a
[ 114.915773] Code: 16 b4 6d 9e eb 0e 48 c7 c2 63 58 70 9e 48 c7 c7 db 07 6d 9e 56 48 89 fe 48 c7 c7 10 8c 76 9e 51 48 89 c1 41 52 e8 c8 48 ff ff <0f> 0b 48 89 d9 49 89 e8 44 89 e2 31 f6 48 81 e9 00 00 00 9d 48 c7
[ 114.915776] RSP: 0018:ffffcc907e0a7c70 EFLAGS: 00010282
[ 114.915778] RAX: 000000000000006d RBX: ffff89226ceb1a00 RCX: 0000000000000000
[ 114.915780] RDX: ffff8940de0edfc0 RSI: ffff8940de0e0940 RDI: ffff8940de0e0940
[ 114.915782] RBP: 0000000000012160 R08: 0000000000000000 R09: ffffcc907e0a7b28
[ 114.915784] R10: ffffffff9eb17ea8 R11: 0000000000000003 R12: 0000000000000001
[ 114.915785] R13: ffff89226cec3b60 R14: ffffcc907e0a7e78 R15: ffffcc907e0a7de0
[ 114.915787] FS: 00007f9d45e8c740(0000) GS:ffff8940de0c0000(0000) knlGS:0000000000000000
[ 114.915788] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 114.915790] CR2: 00007f9d45b1c000 CR3: 0000001edbfec000 CR4: 0000000000750ef0
[ 114.915792] PKRU: 55555554
[ 114.915794] Call Trace:
[ 114.915797] <TASK>
[ 114.915799] __check_heap_object+0x7e/0xa0
[ 114.915803] __check_object_size+0x1ec/0x200
[ 114.915807] zfs_uiomove_iter+0x40/0xc0
[ 114.915810] dmu_read_uio_dnode+0xb1/0x160
[ 114.915813] zfs_read+0x3c7/0x550
[ 114.915816] zpl_iter_read+0x98/0xf0
[ 114.915819] vfs_read+0x1de/0x300
[ 114.915822] ksys_read+0x5d/0xd0
[ 114.915824] do_syscall_64+0x52/0x120
[ 114.915827] entry_SYSCALL_64_after_hwframe+0x50/0x58
[ 114.915831] RIP: 0033:0x7f9d45f1f977
[ 114.915833] Code: 00 49 89 d0 48 89 fa 4c 89 df e8 54 b8 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 16 5b c3 0f 1f 40 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 e2 e8 23 ff ff ff
[ 114.915836] RSP: 002b:00007ffc578556f0 EFLAGS: 00000202 ORIG_RAX: 0000000000000000
[ 114.915838] RAX: ffffffffffffffda RBX: 00007f9d45e8c740 RCX: 00007f9d45f1f977
[ 114.915840] RDX: 0000000000080000 RSI: 00007f9d45b1d000 RDI: 0000000000000003
[ 114.915842] RBP: 0000000000080000 R08: 0000000000000000 R09: 0000000000000000
[ 114.915843] R10: 0000000000000000 R11: 0000000000000202 R12: 00007f9d45b1d000
[ 114.915845] R13: 0000000000000003 R14: 00007f9d460b0000 R15: 0000000000040000
[ 114.915848] </TASK>
[ 114.915849] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq zram lz4_compress lz4_decompress netconsole wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel cls_fw sch_pie sch_sfq sch_htb cls_u32 sch_ingress sch_cake xt_mark xt_DSCP xt_owner ip6table_mangle ip6table_filter ip6_tables ts_bm xt_string iptable_mangle xt_TCPMSS pppoe xt_MASQUERADE pppox ppp_generic slhc 8021q garp mrp xt_LOG nf_log_syslog xt_conntrack xt_length xt_nat xt_set iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter ip_tables ip_set_hash_ip ip_set_hash_net ip_set act_mirred ifb amdgpu amd_atl intel_rapl_msr intel_rapl_common snd_hda_codec_realtek amd64_edac snd_hda_codec_generic edac_mce_amd i2c_algo_bit snd_hda_scodec_component snd_hda_codec_hdmi drm_exec kvm_amd snd_hda_intel drm_suballoc_helper amdxcp snd_intel_dspcfg mfd_core kvm snd_intel_sdw_acpi snd_usb_audio drm_buddy bridge irqbypass snd_hda_codec snd_usbmidi_lib
[ 114.915882] gpu_sched stp crct10dif_pclmul snd_hda_core snd_rawmidi eeepc_wmi drm_ttm_helper llc ghash_clmulni_intel snd_hwdep snd_seq_device asus_wmi ttm sp5100_tco sha512_ssse3 snd_pcm sparse_keymap cfg80211 drm_display_helper sha256_ssse3 snd_timer platform_profile drm_kms_helper rfkill sha1_ssse3 i2c_piix4 wmi_bmof asus_ec_sensors snd video rapl jc42 mc efi_pstore ccp pcspkr k10temp i2c_smbus gpio_amdpt soundcore i2c_designware_platform regmap_i2c wmi gpio_generic joydev i2c_designware_core mac_hid nfsd auth_rpcgss nct6775 nfs_acl nct6775_core lockd drm hwmon_vid grace backlight msr fuse loop sunrpc i2c_core dm_mod configfs nfnetlink efivarfs sd_mod r8169 realtek crc32_pclmul xhci_pci mdio_devres ahci nvme crc32c_intel xhci_hcd libphy libahci nvme_core aesni_intel gf128mul crypto_simd cryptd
[ 114.915984] ---[ end trace 0000000000000000 ]---
[ 115.041833] RIP: 0010:usercopy_abort+0x78/0x7a
[ 115.041847] Code: 16 b4 6d 9e eb 0e 48 c7 c2 63 58 70 9e 48 c7 c7 db 07 6d 9e 56 48 89 fe 48 c7 c7 10 8c 76 9e 51 48 89 c1 41 52 e8 c8 48 ff ff <0f> 0b 48 89 d9 49 89 e8 44 89 e2 31 f6 48 81 e9 00 00 00 9d 48 c7
[ 115.041850] RSP: 0018:ffffcc907e0a7c70 EFLAGS: 00010282
[ 115.041853] RAX: 000000000000006d RBX: ffff89226ceb1a00 RCX: 0000000000000000
[ 115.041855] RDX: ffff8940de0edfc0 RSI: ffff8940de0e0940 RDI: ffff8940de0e0940
[ 115.041860] RBP: 0000000000012160 R08: 0000000000000000 R09: ffffcc907e0a7b28
[ 115.041865] R10: ffffffff9eb17ea8 R11: 0000000000000003 R12: 0000000000000001
[ 115.041867] R13: ffff89226cec3b60 R14: ffffcc907e0a7e78 R15: ffffcc907e0a7de0
[ 115.041869] FS: 00007f9d45e8c740(0000) GS:ffff8940de540000(0000) knlGS:0000000000000000
[ 115.041871] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 115.041872] CR2: 00007f7763fd31a8 CR3: 0000001edbfec000 CR4: 0000000000750ef0
[ 115.041874] PKRU: 55555554
No idea how I'm getting those with ECC ram.
On backup the entire directory is toast in recent snapshots and the files are old and not touched for half a year:
$ ls minecraft/server-irina/world
ls: reading directory 'minecraft/server-irina/world': Input/output error
No dmesg errors this time. I compiled zfs builtin with debug, debuginfo, asan, ubsan. It looks it doesn't trap on any additional checks or may be they are not enabled if zfs is actually not in the module. May be I should build entire kernel with asan and ubsan? It looks like only one dataset is affected, so I'll try to migrate all the data from it. Smooth sailing with ZFS for 2 years and suddenly, I don't know how can I contain the corruption. Sounds like I have to copy everything off ZFS to be sure nothing dodgy happening in RAM.
rsync'd data out and destroyed the bad dataset tripping on assert which made the pool only importable only in readonly:
VERIFY0(ba.ba_phys->bt_uncomp) failed (130048)
PANIC at bptree.c:298:bptree_iterate()
Showing stack for process 5081
CPU: 28 UID: 0 PID: 5081 Comm: txg_sync Tainted: G T 6.12.53-x86_64-dirty #85
Tainted: [T]=RANDSTRUCT
Hardware name: ASUS System Product Name/TUF GAMING X670E-PLUS, BIOS 3287 09/20/2025
Call Trace:
<TASK>
dump_stack_lvl+0x4b/0x70
spl_panic+0xd3/0xea
? dmu_buf_will_dirty_impl+0xd7/0x310
? dnode_partial_zero+0x109/0x180
? dnode_free_range+0x3fa/0x8a0
? dnode_rele_and_unlock+0x79/0x1e0
bptree_iterate+0x346/0x4c0
? dsl_scan_obsolete_block_cb+0xc0/0xc0
dsl_process_async_destroys+0xa5/0x720
dsl_scan_sync+0x175/0xc20
? __list_del_entry+0x9/0x20
? list_del+0x9/0x20
spa_sync_iterate_to_convergence+0x141/0x320
spa_sync+0x2f8/0x8d0
txg_sync_thread+0x274/0x3c0
? txg_is_quiescing+0x50/0x50
? spl_assert.constprop.0+0x20/0x20
thread_generic_wrapper+0x63/0x90
kthread+0xcf/0x100
? kthread_park+0x80/0x80
ret_from_fork+0x31/0x50
? kthread_park+0x80/0x80
ret_from_fork_asm+0x11/0x20
</TASK>
Interestingly, another dataset
- I can succesfully rsync data out
- zstreamdump doesn't find any errors
- But it trips on receive
I'll try to disable rand struct kernel hardening to see if it related.
The assertion looks pretty strange though.
VERIFY3U(drrw->drr_logical_size, >, dn->dn_datablksz) failed (22528 > 131072)
PANIC at dmu_recv.c:2264:flush_write_batch_impl()
Showing stack for process 10396
CPU: 15 UID: 0 PID: 10396 Comm: receive_writer Tainted: G T 6.12.52-x86_64-dirty #89
Tainted: [T]=RANDSTRUCT
Hardware name: ASUS System Product Name/TUF GAMING X670E-PLUS, BIOS 3287 09/20/2025
Call Trace:
<TASK>
dump_stack_lvl+0x4b/0x70
spl_panic+0xd3/0xea
? txg_list_add+0xa1/0x100
? dsl_dir_dirty+0x38/0x80
? dsl_pool_dirty_space+0xbd/0x170
? kfree+0x20d/0x280
flush_write_batch_impl+0x3c1/0x520
flush_write_batch+0x36/0xa0
receive_process_record+0xc1/0x350
receive_writer_thread+0xbb/0x220
? spl_thread_exit+0x10/0x10
? spl_assert.constprop.0+0x20/0x20
thread_generic_wrapper+0x63/0x90
kthread+0xcf/0x100
? kthread_park+0x80/0x80
ret_from_fork+0x31/0x50
? kthread_park+0x80/0x80
ret_from_fork_asm+0x11/0x20
</TASK>
Cannot receive from any dataset where large blocks was used without tripping on the assertion. I'm rsync-ing my way out and starting all over from clean snapshots without larger blocks. How I ended here:
- I had original pool with large blocks
- I had 'archive' where I was sending with -L
- I move original pool to another (v2) one with -R -L -e -c and it seemed have worked fine.
- I could not send -L -R -e -c from v2 to v3 - corruption started.
- I restored most of data from 'archive' to v3 with -L.
- I kept syncing daily from v3 to archive.
- I tried to move from v3 to v4 when I noticed corruption on v3 and archive.
- zpool scrub was never finding any problems on v2, v3, v4.
- I compiled zfs with -debug -asan -ubsan and noticed tripping assert on receive which now help me to contain corruption I believe.
- I'm rsyncing data from all datasets with large blocks and starting clean without any large blocks in snapshots. I can send those without tripping on the asserts.
- I'm going disable large blocks feature on the new pool v5 I guess to make sure I don't have of this taint in my data.
- I think if try to delete the bad dataset from archive it will break it, so I have to migrate all the data off there somehow before cleanup.
Just to be sure - you didn't over/underclock your CPU/RAM? I've played with under/overclocking some time ago (but with non-ECC RAM) and I believe CPU may have corrupted data in it's caches in these extreme cases (Had explicit corruptions with Intel on undervolting, silent corruptions on AMD Ryzen fclk (infinity fabric) overclock). But I don't think it's the case here.
You have quite similar idea of what I had. How can everything be that broken, must be something with my system, but how come:
- scrubs are completely clean?
- i can recompile gentoo for hours without issues
- first 'send/recv' was clean and tool less than an hour (second on the next hour was corrupted as hell and problems started)
- and still scrub was clean
I have stock clocks for 7950x3d, also I have ECC RAM KSM56E46BD8KM-32HA which are capable of running at 5200, but runs at 3600 as memory controller for AMD suck with 4 DIMM. I don't mind that as I value stability. I had some issues with PSU which caused SSD disconnects due to voltage drops, but now I have 1600W PSU which holds voltages really well. I can build Gentoo world without any issues (this catches instability really well) and never had any corruption issues with ZFS before. As I was getting crazy with this kinda of corruptions I even ran memtest successfully for 12 hours, but it means nothing as people say.
I regained my sanity and trust in ZFS and my system with debug build. Now, with ZFS full debugging, asan, ubsan I'm really careful to not let the corruption spread. The failed assert clearly gives away that the code path taken is incorrect.
/*
* The WRITE record is larger than the object's block
* size. We must be receiving an incremental
* large-block stream into a dataset that previously did
* a non-large-block receive. Lightweight writes must
* be exactly one block, so we need to decompress the
* data (if compressed) and do a normal dmu_write().
*/
ASSERT3U(drrw->drr_logical_size, >, dn->dn_datablksz);
My theory is ZFS forgets about large blocks feature after a couple of send/receive, especially if the recordsize is back to 128k (but some large blocks remain as it was different at some point). All my sends were automated and used -L flag. And this assert trips for a non-incremental replication stream.
It is not random, I'm getting the assertion tripped in receiver from any full replication send from dataset with large blocks I have (with or without -L now). I believe it is causing the corruption which manifests in lots of spectacular ways later. I can rsync from those datasets without any issue. I can send/recv from datasets without large blocks. At this point I think I contained the corruption and cleaning up the remaining datasets without large blocks. I might want to keep running debug ZFS from now on.
Maybe related to https://github.com/openzfs/zfs/pull/10383 ? @ahrens maybe you may see something specific ^
I haven't found easy reproducer for the corruption yet, but I can crash my debug build of ZFS-2.3.4 (Linux 6.12.56) with the following minimal reproducer:
#!/bin/bash
TEST_DS=exp/test
zfs destroy -r -f $TEST_DS
zfs create -p $TEST_DS
zfs create $TEST_DS/source1
zfs set recordsize=1M $TEST_DS/source1
dd if=/dev/urandom of=/$TEST_DS/source1/file bs=300000 count=1
zfs snapshot -r $TEST_DS/source1@snap1
zfs create $TEST_DS/replica
zfs send -R $TEST_DS/source1@snap1 | zfs recv -F $TEST_DS/replica/small
Causing the panic on debug check:
[ 109.375242] VERIFY3U(drrw->drr_logical_size, >, dn->dn_datablksz) failed (37888 > 131072)
[ 109.376540] PANIC at dmu_recv.c:2264:flush_write_batch_impl()
[ 109.377542] Showing stack for process 1779
[ 109.378068] CPU: 21 UID: 0 PID: 1779 Comm: receive_writer Not tainted 6.12.53-x86_64-dirty #93
[ 109.378070] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20251019_163104-localhost 04/01/2014
[ 109.378071] Call Trace:
[ 109.378073] <TASK>
[ 109.378076] dump_stack_lvl+0x4b/0x70
[ 109.378080] spl_panic+0xd3/0xea
[ 109.378082] ? _raw_spin_unlock+0xe/0x20
[ 109.378083] ? txg_list_add+0xa1/0x100
[ 109.378085] ? dsl_dir_dirty+0x38/0x80
[ 109.378087] ? _raw_spin_unlock+0xe/0x20
[ 109.378088] ? dsl_pool_dirty_space+0xbd/0x170
[ 109.378089] ? kfree+0x20d/0x280
[ 109.378091] flush_write_batch_impl+0x3c1/0x520
[ 109.378092] flush_write_batch+0x36/0xa0
[ 109.378093] receive_process_record+0xc1/0x350
[ 109.378094] receive_writer_thread+0xbb/0x220
[ 109.378095] ? spl_thread_exit+0x10/0x10
[ 109.378095] ? spl_assert.constprop.0+0x20/0x20
[ 109.378096] thread_generic_wrapper+0x63/0x90
[ 109.378098] kthread+0xcf/0x100
[ 109.378099] ? kthread_park+0x80/0x80
[ 109.378100] ret_from_fork+0x31/0x50
[ 109.378102] ? kthread_park+0x80/0x80
[ 109.378102] ret_from_fork_asm+0x11/0x20
[ 109.378104] </TASK>
I tested this in a VM. It's kinda surprising it crashes. Does it mean the code is not tested in testsuite? It doesn't look like a cause of the corruption to be fair. This is related to splitting blocks and was producing randomly broken streams when corruption was already there in my case above. Source of the corruption was probably something else.
I tried to figure out why stream with small blocks doesn't hit the assertion check, so I did zstreamdump -v for the reproducer above. The only difference is commented out line which enables large blocks:
zfs set recordsize=1M $TEST_DS/source1
--- small_blocks-stream-dump.txt 2025-11-09 21:42:51.395935243 +1100
+++ large_blocks-stream-dump.txt 2025-11-09 21:42:43.167928599 +1100
@@ -8,23 +8,24 @@
toguid = 0
fromguid = 0
toname = exp/test/source1@snap1
- payloadlen = 460
+ payloadlen = 500
nvlist version: 0
tosnap = snap1
fss = (embedded nvlist)
nvlist version: 0
- 0x1b02f3c3d4fa2e3 = (embedded nvlist)
+ 0x5deb8cd7cc70082a = (embedded nvlist)
nvlist version: 0
name = exp/test/source1
parentfromsnap = 0x0
props = (embedded nvlist)
nvlist version: 0
+ recordsize = 0x100000
(end props)
snaps = (embedded nvlist)
nvlist version: 0
- snap1 = 0xc975be984fbd8e9
+ snap1 = 0x6928a91b8ad1ed46
(end snaps)
snapprops = (embedded nvlist)
@@ -35,19 +36,19 @@
(end snapprops)
- (end 0x1b02f3c3d4fa2e3)
+ (end 0x5deb8cd7cc70082a)
(end fss)
-END checksum = 118f67256f/57bd8457002/1350a3d109def/3895fcccd2b270
+END checksum = 120774305e/648f2311c7b/178dd4a901fad/476e5615874d65
BEGIN record
hdrtype = 1
features = 4
magic = 2f5bacbac
- creation_time = 69106fb2
+ creation_time = 69107008
type = 2
flags = 0xc
- toguid = c975be984fbd8e9
+ toguid = 6928a91b8ad1ed46
fromguid = 0
toname = exp/test/source1@snap1
payloadlen = 0
@@ -57,10 +58,10 @@
WRITE object = 1 type = 21 checksum type = 7 compression type = 0 flags = 0 offset = 0 logical_size = 512 compressed_size = 0 payload_size = 512 props = 200000000 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
FREE object = 1 offset = 512 length = 1024
OBJECT object = 2 type = 19 bonustype = 44 blksz = 131072 bonuslen = 176 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
-FREE object = 2 offset = 393216 length = -1
-WRITE object = 2 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 0 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
-WRITE object = 2 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 131072 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
-WRITE object = 2 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 262144 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = f004b00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
+FREE object = 2 offset = 300032 length = -1
+WRITE object = 2 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 0 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
+WRITE object = 2 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 131072 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
+WRITE object = 2 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 262144 logical_size = 37888 compressed_size = 0 payload_size = 37888 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
FREEOBJECTS firstobj = 3 numobjs = 29
OBJECT object = 32 type = 45 bonustype = 0 blksz = 512 bonuslen = 0 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 32 offset = 512 length = -1
@@ -83,19 +84,19 @@
WRITE object = 36 type = 47 checksum type = 7 compression type = 0 flags = 0 offset = 16384 logical_size = 16384 compressed_size = 0 payload_size = 16384 props = f0006001f salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
FREE object = 36 offset = 32768 length = 16384
FREEOBJECTS firstobj = 37 numobjs = 27
-END checksum = 9466d6b01ca1/a0a225565980b0e8/5a67523caa934a55/7726d21184e587f8
+END checksum = 941141e56b74/6c7a4073420e6855/979e38b8779996bd/afa822331325f63a
END checksum = 0/0/0/0
SUMMARY:
- Total DRR_BEGIN records = 2 (460 bytes)
+ Total DRR_BEGIN records = 2 (500 bytes)
Total DRR_END records = 3 (0 bytes)
Total DRR_OBJECT records = 7 (352 bytes)
Total DRR_FREEOBJECTS records = 2 (0 bytes)
- Total DRR_WRITE records = 10 (429568 bytes)
+ Total DRR_WRITE records = 10 (336384 bytes)
Total DRR_WRITE_BYREF records = 0 (0 bytes)
Total DRR_WRITE_EMBEDDED records = 0 (0 bytes)
Total DRR_FREE records = 12 (0 bytes)
Total DRR_SPILL records = 0 (0 bytes)
Total records = 36
- Total payload size = 430380 (0x6912c)
+ Total payload size = 337236 (0x52554)
Total header overhead = 11232 (0x2be0)
- Total stream length = 441612 (0x6bd0c)
+ Total stream length = 348468 (0x55134)
It looks like when large_blocks are enabled for the send stream WRITE object uses logical_size = 37888 payload_size = 37888, while without large_blocks: logical_size = 37888 payload_size = 37888. On receiver the stream seem to be breaking assumption that logical size can only be larger than data block size and only for large_blocks feature, so it goes into a branch using dmu_write_by_dnode() instead of dmu_lightweight_write_by_dnode().
if (drrw->drr_logical_size != dn->dn_datablksz) {
/*
* The WRITE record is larger than the object's block
* size. We must be receiving an incremental
* large-block stream into a dataset that previously did
* a non-large-block receive. Lightweight writes must
* be exactly one block, so we need to decompress the
* data (if compressed) and do a normal dmu_write().
*/
ASSERT3U(drrw->drr_logical_size, >, dn->dn_datablksz); <- crashing here
Interestingly, dmu_write_by_node() is not used anywhere else:
/*
* This interface is not used internally by ZFS but is provided for
* use by Lustre which is built on the DMU interfaces.
*/
int
dmu_write_by_dnode_flags(dnode_t *dn, uint64_t offset, uint64_t size,
const void *buf, dmu_tx_t *tx, uint32_t flags)
{
dmu_buf_t **dbp;
int numbufs;
int error;
if (size == 0)
return (0);
/* Allow Direct I/O when requested and properly aligned */
if ((flags & DMU_DIRECTIO) && zfs_dio_page_aligned((void *)buf) &&
zfs_dio_aligned(offset, size, dn->dn_datablksz)) {
abd_t *data = abd_get_from_buf((void *)buf, size);
error = dmu_write_abd(dn, offset, size, data, DMU_DIRECTIO, tx);
abd_free(data);
return (error);
}
VERIFY0(dmu_buf_hold_array_by_dnode(dn, offset, size,
FALSE, FTAG, &numbufs, &dbp, DMU_READ_PREFETCH));
dmu_write_impl(dbp, numbufs, offset, size, buf, tx);
dmu_buf_rele_array(dbp, numbufs, FTAG);
return (0);
}
int
dmu_write_by_dnode(dnode_t *dn, uint64_t offset, uint64_t size,
const void *buf, dmu_tx_t *tx)
{
return (dmu_write_by_dnode_flags(dn, offset, size, buf, tx, 0));
}
@ahrens can you have a look, you did the changes for large block send streams, so may be you know what is the correct behavior?
Well, for the 2 weeks after I stopped using large blocks the system is rock solid on 2.3.4. I'm happy that the story is over. I did a couple of migrations to the desired layout of my pools without any issues. I'm using it normally. My confidence in ZFS and my system is recovering, but I no longer trust scrub to detect all data problem, so I have written a small program which reads all files and reports any IO errors. I run it periodically in addition to scrub for additional safety.
Retrospectively, I would say the issues started to happen after 'zfs send' migration. The later problems seem to be caused by an corruption which happening during that process, which manifested in surprising flaky sends, accessing past end of object and unimportable (in debug ZFS build) pool after deleting of the affected datasets. The debug assertion failure I was hitting during data recovery has no direct relation to the corruption and seem just a bug in split-block handling for large block send without -L (see reproducer above). I tried to find a fix by hacking the sender code, but it looks like the receiver code is wrong and doesn't expect a split-block smaller than datablksz. Having this bug in debug build makes me wonder if the code is untested and potential cause of hard to reproduce problem I've encountered.
Were you ever able to reproduce the corruption when not using the -e option for the send stream?
That stood out to me in your earlier transcripts, and might be something to isolate if you managed to figure out a way to reproduce.