zfs icon indicating copy to clipboard operation
zfs copied to clipboard

zfs send with large blocks is unreliable / data corruption

Open IvanVolosyuk opened this issue 3 months ago • 15 comments

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version Live
Kernel Version 6.12.51-gentoo
Architecture x86_64
OpenZFS Version 2.3.4
ECC RAM Yes

Describe the problem you're observing

Confusing messages from ZFS and broken zstreamdump.

Describe how to reproduce the problem

I don't know how reproducible it is: I copied all data from one SSD pool to another. using zfs send -L -R -e -c ssd@move | zfs recv -F new Destroyed original pool, renamed new to ssd, created 'new' using zpool create new -o ashift=12 -o compatibility=openzfs-2.0-linux raidz .... . Note the compatibility is lower, I wanted to do that for intemediate pool, but forgot, so it used all the features from ZFS 2.3.4.

When running the same migration command it failed 'zfs send -L -R -e -c ssd@move | zfs recv -F new' with:

cannot receive new filesystem stream: incompatible embedded data stream feature with encrypted receive.

I don't use ZFS encryption in neither source or destination pools. Most of datasets copied correctly, but 'ssd/home' didn't. When running without embed and compress flags:

$ zfs send  -L -R  ssd/home@move | zfs recv new/home
cannot receive new filesystem stream: invalid backup stream

Interestingly if just run zfs send -L -R -e -c ssd/home@move | zstreamdump - it produces valid stream with multiple child datasets (see attached). But, if I run:

zfs send -L ssd/home@move | zstreamdump
BEGIN record
        hdrtype = 1
        features = 800004
        magic = 2f5bacbac
        creation_time = 68e6180f
        type = 2
        flags = 0xc
        toguid = 2e95a3e59984549e
        fromguid = 0
        toname = ssd/home@move
        payloadlen = 0
invalid checksum
Incorrect checksum in record header.
Expected checksum = 3d53e21dc5327c23/99576f8530ce791c/940bce62d8b523c2/239a036d344a2d8
SUMMARY:
        Total DRR_BEGIN records = 1 (0 bytes)
        Total DRR_END records = 0 (0 bytes)
        Total DRR_OBJECT records = 134126 (23634344 bytes)
        Total DRR_FREEOBJECTS records = 7076 (0 bytes)
        Total DRR_WRITE records = 216867 (12381282816 bytes)
        Total DRR_WRITE_BYREF records = 0 (0 bytes)
        Total DRR_WRITE_EMBEDDED records = 0 (0 bytes)
        Total DRR_FREE records = 136992 (0 bytes)
        Total DRR_SPILL records = 0 (0 bytes)
        Total records = 495062
        Total payload size = 12404917160 (0x2e36403a8)
        Total header overhead = 154459344 (0x934dcd0)
        Total stream length = 12559376816 (0x2ec98e1b0)

The stream randomly fails sometimes faster, sometimes longer, sometimes does not - produces a valid stream. The system is very stable, no CPU issues ever seen, no unexpected crashes in logs, so I can rule out general instability. Full replication stream with embedding and compression is produced correctly.

Scrub returns clean, while isolated diff gives errors, reading 'logs' gives io error. Looks like something got corrupted during previous send?

 zfs diff ssd/home@2025-10-08 ssd/home@move
M       /home/ivan
M       /home/ivan/experiments/code/temperature-control/target/debug/temperature-server.d
M       /home/ivan/.motd
Unable to determine path or stats for object 296192 in ssd/home@move: Input/output error
-       /home/minecraft/server-irina/logs/2025-05-06-1.log.gz
Unable to determine path or stats for object 296193 in ssd/home@move: Input/output error
-       /home/minecraft/server-irina/logs/2025-04-26-1.log.gz
M       /home/ivan/experiments/code/temperature-control/server_state_v2.bin

Include any warning/errors/backtraces from the system logs

zstreamdump-good.txt

IvanVolosyuk avatar Oct 08 '25 10:10 IvanVolosyuk

Something went terribly wrong. I deleted the datasets and restored from backup and everything seem to be fine, but now I'm getting:

[80183.934665] PANIC: zfs: accessing past end of object 4ea/b04c9 (size=259072 access=259068+4096)                                          
[80183.934677] Showing stack for process 737938                                                                                             
[80183.934681] CPU: 27 UID: 1000 PID: 737938 Comm: ThreadPoolForeg Tainted: G                T  6.12.52-x86_64-dirty #51                    
[80183.934682] Tainted: [T]=RANDSTRUCT                                                                                                      
[80183.934683] Hardware name: ASUS System Product Name/TUF GAMING X670E-PLUS, BIOS 3287 09/20/2025                                          
[80183.934684] Call Trace:                                                                                                                  
[80183.934686]  <TASK>                                                                                                                      
[80183.934688]  dump_stack_lvl+0x4b/0x70                                                                                                    
[80183.934693]  vcmn_err.cold+0x54/0x7f                                                                                                     
[80183.934697]  ? dbuf_find+0x23d/0x250                                                                                                     
[80183.934698]  ? arc_access+0x325/0x610                                                                                                    
[80183.934699]  zfs_panic_recover+0x61/0x70                                                                                                 
[80183.934702]  dmu_buf_hold_array_by_dnode+0x20c/0x680                                                                                     
[80183.934703]  ? dnode_hold_impl+0xad6/0x1300                                                                                              
[80183.934704]  dmu_write_uio_dnode+0x85/0x2c0                                                                                              
[80183.934705]  ? dmu_object_set_blocksize+0x58/0x70                                                                                        
[80183.934707]  zfs_write+0x4d1/0xcb0                                                                                                       
[80183.934710]  zpl_iter_write+0x10d/0x170                                                                                                  
[80183.934711]  vfs_write+0x230/0x3e0                                                                                                       
[80183.934713]  __x64_sys_pwrite64+0x76/0xb0                                                                                                
[80183.934714]  do_syscall_64+0x52/0x120                                                                                                    
[80183.934715]  entry_SYSCALL_64_after_hwframe+0x50/0x58                                                                                    
[80183.934717] RIP: 0033:0x7fe5fde78572                                                                                                     
[80183.934718] Code: 08 0f 85 01 49 ff ff 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08
 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 48 83 ec 08                                                                  
[80183.934719] RSP: 002b:00007fe5ef58c3e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000012                                                       
[80183.934721] RAX: ffffffffffffffda RBX: 00007fe5ef58f6c0 RCX: 00007fe5fde78572                                                            
[80183.934721] RDX: 0000000000001000 RSI: 0000278c04b48800 RDI: 0000000000000111                                                            
[80183.934722] RBP: 00007fe5ef58c490 R08: 0000000000000000 R09: 0000000000000000                                                            
[80183.934722] R10: 000000000003f3fc R11: 0000000000000246 R12: 0000000000000111                                                            
[80183.934723] R13: 0000278c000c8100 R14: 0000000000001000 R15: 000000000003f3fc                                                            
[80183.934724]  </TASK>
....
[80364.904002] INFO: task ThreadPoolForeg:737938 blocked for more than 122 seconds.                                                                                                                                                                                                         
[80364.904013]       Tainted: G                T  6.12.52-x86_64-dirty #51                                                                                                                                                                                                                  
[80364.904016] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.                                                                                                                                                                                                    
[80364.904018] task:ThreadPoolForeg state:D stack:0     pid:737938 tgid:737888 ppid:737874 flags:0x00000002                                                                                                                                                                                 
[80364.904023] Call Trace:                                                                                                                                                                                                                                                                  
[80364.904025]  <TASK>                                                                                                                                                                                                                                                                      
[80364.904029]  __schedule+0x4ba/0xb50                                                                                                                                                                                                                                                      
[80364.904037]  schedule+0x27/0xe0                                                                                                                                                                                                                                                          
[80364.904040]  vcmn_err.cold+0x69/0x7f                                                                                                                                                                                                                                                     
[80364.904044]  ? dbuf_find+0x23d/0x250                                                                                                                                                                                                                                                     
[80364.904047]  ? arc_access+0x325/0x610                                                                                                                                                                                                                                                    
[80364.904050]  zfs_panic_recover+0x61/0x70                                                                                                                                                                                                                                                 
[80364.904055]  dmu_buf_hold_array_by_dnode+0x20c/0x680                                                                                                                                                                                                                                     
[80364.904057]  ? dnode_hold_impl+0xad6/0x1300                                                                                                                                                                                                                                              
[80364.904060]  dmu_write_uio_dnode+0x85/0x2c0                                                                                                                                                                                                                                              
[80364.904062]  ? dmu_object_set_blocksize+0x58/0x70                                                                                                                                                                                                                                        
[80364.904066]  zfs_write+0x4d1/0xcb0                                                                                                                                                                                                                                                       
[80364.904070]  zpl_iter_write+0x10d/0x170                                                                                                                                                                                                                                                  
[80364.904073]  vfs_write+0x230/0x3e0                                                                                                                                                                                                                                                       
[80364.904077]  __x64_sys_pwrite64+0x76/0xb0                                                                                                                                                                                                                                                
[80364.904079]  do_syscall_64+0x52/0x120                                                                                                                                                                                                                                                    
[80364.904082]  entry_SYSCALL_64_after_hwframe+0x50/0x58                                                                                                                                                                                                                                    
[80364.904085] RIP: 0033:0x7fe5fde78572                                                                                                                                                                                                                                                     
[80364.904088] RSP: 002b:00007fe5ef58c3e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000012                                                                                                                                                                                                       
[80364.904091] RAX: ffffffffffffffda RBX: 00007fe5ef58f6c0 RCX: 00007fe5fde78572                                                                                                                                                                                                            
[80364.904093] RDX: 0000000000001000 RSI: 0000278c04b48800 RDI: 0000000000000111                                                                                                                                                                                                            
[80364.904094] RBP: 00007fe5ef58c490 R08: 0000000000000000 R09: 0000000000000000                                                                                                                                                                                                            
[80364.904096] R10: 000000000003f3fc R11: 0000000000000246 R12: 0000000000000111                                                                                                                                                                                                            
[80364.904098] R13: 0000278c000c8100 R14: 0000000000001000 R15: 000000000003f3fc                                                                                                                                                                                                            
[80364.904100]  </TASK>                                                                                                                                                                                                                                                                     
[80487.784292] INFO: task txg_quiesce:5009 blocked for more than 122 seconds.                                                                                                                                                                                                               
[80487.784301]       Tainted: G                T  6.12.52-x86_64-dirty #51                                                                                                                                                                                                                  
[80487.784303] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.                                                                                                                                                                                                    
[80487.784305] task:txg_quiesce     state:D stack:0     pid:5009  tgid:5009  ppid:2      flags:0x00004000                                                                                                                                                                                   
[80487.784309] Call Trace:                                                                                                                                                                                                                                                                  
[80487.784311]  <TASK>                                                                                                                                                                                                                                                                      
[80487.784316]  __schedule+0x4ba/0xb50                                                                                                                                                                                                                                                      
[80487.784322]  schedule+0x27/0xe0                                                                                                                                                                                                                                                          
[80487.784325]  cv_wait_common+0xdb/0x110                                                                                                                                                                                                                                                   
[80487.784328]  ? prepare_to_wait_event+0x180/0x180                                                                                                                                                                                                                                         
[80487.784332]  txg_quiesce_thread+0x257/0x400                                                                                                                                                                                                                                              
[80487.784335]  ? txg_delay+0x170/0x170                                                                                                                                                                                                                                                     
[80487.784336]  ? current_is_reclaim_thread+0x20/0x20                                                                                                                                                                                                                                       
[80487.784339]  thread_generic_wrapper+0x5a/0x70                                                                                                                                                                                                                                            
[80487.784342]  kthread+0xcf/0x100                                                                                                                                                                                                                                                          
[80487.784344]  ? kthread_park+0x80/0x80                                                                                                                                                                                                                                                    
[80487.784346]  ret_from_fork+0x31/0x50                                                                                                                                                                                                                                                     
[80487.784349]  ? kthread_park+0x80/0x80                                                                                                                                                                                                                                                    
[80487.784351]  ret_from_fork_asm+0x11/0x20                                                                                                                                                                                                                                                 
[80487.784354]  </TASK>                                                                                                                                                                                                                                                                     
[80487.784548] INFO: task ThreadPoolForeg:737938 blocked for more than 245 seconds.                                                                                                                                                                                                         
[80487.784550]       Tainted: G                T  6.12.52-x86_64-dirty #51                                                                                                                                                                                                                  
[80487.784552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.                                                                                                                                                                                                    
[80487.784553] task:ThreadPoolForeg state:D stack:0     pid:737938 tgid:737888 ppid:737874 flags:0x00000002                                                                                                                                                                                 

The system stays functional though. I believe the same failure happens both in 2.2.8 and 2.3.4, this one is from 2.3.4. Something definitely went wrong when I was moving data from SSD to SSD two times. I don't know if the corruption spread to the HDD archive, ECC RAM didn't seem to help me. :(

IvanVolosyuk avatar Oct 14 '25 13:10 IvanVolosyuk

Actually, I don't know, might be a different unrelated problem. I think I identified the file in question and it seem pretty short lived (already overwritten after reboot) and can be read without issues. Scrub returned clean:

$ find /home/ivan/.cache -inum 722121
/home/ivan/.cache/google-chrome-config/Default/History-journal

IvanVolosyuk avatar Oct 14 '25 13:10 IvanVolosyuk

I don't know what happened with the data corruption, but I restored from backup and everything was working fine for a while until I was trying to do another data migration. Still the same errors. For some datasets I'm getting again rzstreamdump corruption at random points ~100G: $ zfs send ssd/home@2025-10-25 | zstreamdump END checksum = 4f9578acb6b0bb42/fa1b021aea65d39f/4193763805b65e17/a954189c504e795f $ zfs send ssd/home@2025-10-25 | zstreamdump Incorrect checksum in record header.
Expected checksum = 4d5a3d2d9634ccdb/a1236048a739b40e/14b0608e84caf759/7035bd097ed11a5

For much larger one ~1T - no corruption. $ zfs send -L ssd/games/SteamLibrary@2025-10-25 | zstreamdump

The same happens on 2.2.8. I have data replicated from striped SSDs to a HDD raidz. I can read the same snapshot from there without an issue and it passed zstreamdump fine.

No data errors in zpool scrub. Any hints on how can I debug that?

IvanVolosyuk avatar Oct 25 '25 08:10 IvanVolosyuk

zfs_arc_min=8G zfs_arc_max=100G zfs.l2arc_noprefetch=0 zfs_txg_timeout=120 spl.spl_taskq_thread_bind=0 spl.spl_taskq_thread_priority=0 zfs.l2arc_rebuild_blocks_min_l2size=1 zfs.zfs_dmu_offset_next_sync=0 zfs.zfs_bclone_enabled=0 zfs.zfs_dio_enabled=0 zfs.zfs_bclone_wait_dirty=0

$ zfs get all ssd/home

NAME      PROPERTY              VALUE                     SOURCE
ssd/home  type                  filesystem                -
ssd/home  creation              Thu Oct 16 21:20 2025     -
ssd/home  used                  108G                      -
ssd/home  available             1.21T                     -
ssd/home  referenced            72.7G                     -
ssd/home  compressratio         1.40x                     -
ssd/home  mounted               yes                       -
ssd/home  quota                 none                      default
ssd/home  reservation           none                      default
ssd/home  recordsize            128K                      default
ssd/home  mountpoint            /home                     local
ssd/home  sharenfs              off                       default
ssd/home  checksum              on                        default
ssd/home  compression           on                        inherited from ssd
ssd/home  atime                 on                        default
ssd/home  devices               on                        default
ssd/home  exec                  on                        default
ssd/home  setuid                on                        default
ssd/home  readonly              off                       default
ssd/home  zoned                 off                       default
ssd/home  snapdir               hidden                    default
ssd/home  aclmode               discard                   inherited from ssd
ssd/home  aclinherit            restricted                default
ssd/home  createtxg             125                       -
ssd/home  canmount              on                        default
ssd/home  xattr                 on                        inherited from ssd
ssd/home  copies                1                         default
ssd/home  version               5                         -
ssd/home  utf8only              off                       -
ssd/home  normalization         none                      -
ssd/home  casesensitivity       sensitive                 -
ssd/home  vscan                 off                       default
ssd/home  nbmand                off                       default
ssd/home  sharesmb              off                       default
ssd/home  refquota              none                      default
ssd/home  refreservation        none                      default
ssd/home  guid                  17780824250341955445      -
ssd/home  primarycache          all                       default
ssd/home  secondarycache        all                       default
ssd/home  usedbysnapshots       277M                      -
ssd/home  usedbydataset         72.7G                     -
ssd/home  usedbychildren        35.0G                     -
ssd/home  usedbyrefreservation  0B                        -
ssd/home  logbias               latency                   default
ssd/home  objsetid              1681                      -
ssd/home  dedup                 off                       default
ssd/home  mlslabel              none                      default
ssd/home  sync                  disabled                  inherited from ssd
ssd/home  dnodesize             legacy                    default
ssd/home  refcompressratio      1.50x                     -
ssd/home  written               24.1M                     -
ssd/home  logicalused           147G                      -
ssd/home  logicalreferenced     106G                      -
ssd/home  volmode               default                   default
ssd/home  filesystem_limit      none                      default
ssd/home  snapshot_limit        none                      default
ssd/home  filesystem_count      none                      default
ssd/home  snapshot_count        none                      default
ssd/home  snapdev               hidden                    default
ssd/home  acltype               off                       inherited from ssd
ssd/home  context               none                      default
ssd/home  fscontext             none                      default
ssd/home  defcontext            none                      default
ssd/home  rootcontext           none                      default
ssd/home  relatime              on                        inherited from ssd
ssd/home  redundant_metadata    all                       default
ssd/home  overlay               on                        default
ssd/home  encryption            off                       default
ssd/home  keylocation           none                      default
ssd/home  keyformat             none                      default
ssd/home  pbkdf2iters           0                         default
ssd/home  special_small_blocks  0                         default
ssd/home  snapshots_changed     Sat Oct 25 18:47:54 2025  -
ssd/home  prefetch              all                       default

$ zpool get all ssd

NAME  PROPERTY                       VALUE                          SOURCE
ssd   size                           3.19T                          -
ssd   capacity                       58%                            -
ssd   altroot                        -                              default
ssd   health                         ONLINE                         -
ssd   guid                           16136043115540145169           -
ssd   version                        -                              default
ssd   bootfs                         -                              default
ssd   delegation                     on                             default
ssd   autoreplace                    off                            default
ssd   cachefile                      -                              default
ssd   failmode                       wait                           default
ssd   listsnapshots                  off                            default
ssd   autoexpand                     off                            default
ssd   dedupratio                     1.00x                          -
ssd   free                           1.31T                          -
ssd   allocated                      1.88T                          -
ssd   readonly                       off                            -
ssd   ashift                         12                             local
ssd   comment                        -                              default
ssd   expandsize                     -                              -
ssd   freeing                        0                              -
ssd   fragmentation                  4%                             -
ssd   leaked                         0                              -
ssd   multihost                      off                            default
ssd   checkpoint                     -                              -
ssd   load_guid                      12418271514860811335           -
ssd   autotrim                       on                             local
ssd   compatibility                  zfs2.0-blake                   local
ssd   bcloneused                     0                              -
ssd   bclonesaved                    0                              -
ssd   bcloneratio                    1.00x                          -
ssd   feature@async_destroy          enabled                        local
ssd   feature@empty_bpobj            active                         local
ssd   feature@lz4_compress           active                         local
ssd   feature@multi_vdev_crash_dump  enabled                        local
ssd   feature@spacemap_histogram     active                         local
ssd   feature@enabled_txg            active                         local
ssd   feature@hole_birth             active                         local
ssd   feature@extensible_dataset     active                         local
ssd   feature@embedded_data          active                         local
ssd   feature@bookmarks              active                         local
ssd   feature@filesystem_limits      enabled                        local
ssd   feature@large_blocks           active                         local
ssd   feature@large_dnode            active                         local
ssd   feature@sha512                 enabled                        local
ssd   feature@skein                  enabled                        local
ssd   feature@edonr                  enabled                        local
ssd   feature@userobj_accounting     active                         local
ssd   feature@encryption             active                         local
ssd   feature@project_quota          active                         local
ssd   feature@device_removal         enabled                        local
ssd   feature@obsolete_counts        enabled                        local
ssd   feature@zpool_checkpoint       enabled                        local
ssd   feature@spacemap_v2            active                         local
ssd   feature@allocation_classes     enabled                        local
ssd   feature@resilver_defer         enabled                        local
ssd   feature@bookmark_v2            active                         local
ssd   feature@redaction_bookmarks    enabled                        local
ssd   feature@redacted_datasets      enabled                        local
ssd   feature@bookmark_written       active                         local
ssd   feature@log_spacemap           active                         local
ssd   feature@livelist               enabled                        local
ssd   feature@device_rebuild         enabled                        local
ssd   feature@zstd_compress          enabled                        local
ssd   feature@draid                  disabled                       local
ssd   feature@zilsaxattr             disabled                       local
ssd   feature@head_errlog            disabled                       local
ssd   feature@blake3                 enabled                        local
ssd   feature@block_cloning          disabled                       local
ssd   feature@vdev_zaps_v2           disabled                       local

IvanVolosyuk avatar Oct 25 '25 09:10 IvanVolosyuk

Example of verbose output of zstreamdump -v 1.

FREE object = 295739 offset = 18446744073709551104 length = 511
OBJECT object = 295740 type = 19 bonustype = 44 blksz = 131072 bonuslen = 192 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 295740 offset = 0 length = -1
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 0 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 131072 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 262144 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 393216 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 524288 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 655360 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 295740 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 786432 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = f007f00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
FREE object = 295740 offset = 917504 length = -1048576
WRITE object = 295740 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 18446744073709420544 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
invalid checksum
Incorrect checksum in record header.
Expected checksum = 3975d21b0ab6c42a/da786bd720cdf74d/ff73fb46e8bd66dd/9b1dc97e2fc41bf
  1. Successful, no checksum errors
WRITE object = 296003 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 5111808 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = f00c700ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
FREE object = 296003 offset = 5242880 length = -5373952
WRITE object = 296003 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 18446744073709420544 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
OBJECT object = 296004 type = 19 bonustype = 44 blksz = 131072 bonuslen = 192 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 296004 offset = 0 length = -1
WRITE object = 296004 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 0 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = f002f00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
WRITE object = 296004 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 131072 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = f002f00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
FREE object = 296004 offset = 262144 length = -393216
WRITE object = 296004 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 18446744073709420544 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
Incorrect checksum in record header.
Expected checksum = 3c0ec487f9414230/9f1d025594c06c71/159619241d1d0704/297fe445157a589b

IvanVolosyuk avatar Oct 25 '25 10:10 IvanVolosyuk

Looks like offset randomly becomes 0xfffffffffffe0000 (18446744073709420544) or -131072 (-128k): WRITE object = 295740 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 18446744073709420544 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000 WRITE object = 296004 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 18446744073709420544 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000

IvanVolosyuk avatar Oct 25 '25 10:10 IvanVolosyuk

Looks like data corruption indeed:

$ cat minecraft/server-irina/world/region/r.-1.-4.mca
Segmentation fault

dmesg:
[  114.915734] usercopy: Kernel memory exposure attempt detected from SLUB object 'zio_data_buf_512' (offset 0, size 74080)!
[  114.915752] ------------[ cut here ]------------
[  114.915754] kernel BUG at mm/usercopy.c:102!
[  114.915759] Oops: invalid opcode: 0000 [#1] PREEMPT SMP
[  114.915762] CPU: 11 UID: 0 PID: 10766 Comm: cat Tainted: G                T  6.12.53-x86_64-dirty #81
[  114.915765] Tainted: [T]=RANDSTRUCT
[  114.915766] Hardware name: ASUS System Product Name/TUF GAMING X670E-PLUS, BIOS 3287 09/20/2025
[  114.915768] RIP: 0010:usercopy_abort+0x78/0x7a
[  114.915773] Code: 16 b4 6d 9e eb 0e 48 c7 c2 63 58 70 9e 48 c7 c7 db 07 6d 9e 56 48 89 fe 48 c7 c7 10 8c 76 9e 51 48 89 c1 41 52 e8 c8 48 ff ff <0f> 0b 48 89 d9 49 89 e8 44 89 e2 31 f6 48 81 e9 00 00 00 9d 48 c7
[  114.915776] RSP: 0018:ffffcc907e0a7c70 EFLAGS: 00010282
[  114.915778] RAX: 000000000000006d RBX: ffff89226ceb1a00 RCX: 0000000000000000
[  114.915780] RDX: ffff8940de0edfc0 RSI: ffff8940de0e0940 RDI: ffff8940de0e0940
[  114.915782] RBP: 0000000000012160 R08: 0000000000000000 R09: ffffcc907e0a7b28
[  114.915784] R10: ffffffff9eb17ea8 R11: 0000000000000003 R12: 0000000000000001
[  114.915785] R13: ffff89226cec3b60 R14: ffffcc907e0a7e78 R15: ffffcc907e0a7de0
[  114.915787] FS:  00007f9d45e8c740(0000) GS:ffff8940de0c0000(0000) knlGS:0000000000000000
[  114.915788] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  114.915790] CR2: 00007f9d45b1c000 CR3: 0000001edbfec000 CR4: 0000000000750ef0
[  114.915792] PKRU: 55555554
[  114.915794] Call Trace:
[  114.915797]  <TASK>
[  114.915799]  __check_heap_object+0x7e/0xa0
[  114.915803]  __check_object_size+0x1ec/0x200
[  114.915807]  zfs_uiomove_iter+0x40/0xc0
[  114.915810]  dmu_read_uio_dnode+0xb1/0x160
[  114.915813]  zfs_read+0x3c7/0x550
[  114.915816]  zpl_iter_read+0x98/0xf0
[  114.915819]  vfs_read+0x1de/0x300
[  114.915822]  ksys_read+0x5d/0xd0
[  114.915824]  do_syscall_64+0x52/0x120
[  114.915827]  entry_SYSCALL_64_after_hwframe+0x50/0x58
[  114.915831] RIP: 0033:0x7f9d45f1f977
[  114.915833] Code: 00 49 89 d0 48 89 fa 4c 89 df e8 54 b8 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 16 5b c3 0f 1f 40 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 e2 e8 23 ff ff ff
[  114.915836] RSP: 002b:00007ffc578556f0 EFLAGS: 00000202 ORIG_RAX: 0000000000000000
[  114.915838] RAX: ffffffffffffffda RBX: 00007f9d45e8c740 RCX: 00007f9d45f1f977
[  114.915840] RDX: 0000000000080000 RSI: 00007f9d45b1d000 RDI: 0000000000000003
[  114.915842] RBP: 0000000000080000 R08: 0000000000000000 R09: 0000000000000000
[  114.915843] R10: 0000000000000000 R11: 0000000000000202 R12: 00007f9d45b1d000
[  114.915845] R13: 0000000000000003 R14: 00007f9d460b0000 R15: 0000000000040000
[  114.915848]  </TASK>
[  114.915849] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq zram lz4_compress lz4_decompress netconsole wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel cls_fw sch_pie sch_sfq sch_htb cls_u32 sch_ingress sch_cake xt_mark xt_DSCP xt_owner ip6table_mangle ip6table_filter ip6_tables ts_bm xt_string iptable_mangle xt_TCPMSS pppoe xt_MASQUERADE pppox ppp_generic slhc 8021q garp mrp xt_LOG nf_log_syslog xt_conntrack xt_length xt_nat xt_set iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter ip_tables ip_set_hash_ip ip_set_hash_net ip_set act_mirred ifb amdgpu amd_atl intel_rapl_msr intel_rapl_common snd_hda_codec_realtek amd64_edac snd_hda_codec_generic edac_mce_amd i2c_algo_bit snd_hda_scodec_component snd_hda_codec_hdmi drm_exec kvm_amd snd_hda_intel drm_suballoc_helper amdxcp snd_intel_dspcfg mfd_core kvm snd_intel_sdw_acpi snd_usb_audio drm_buddy bridge irqbypass snd_hda_codec snd_usbmidi_lib
[  114.915882]  gpu_sched stp crct10dif_pclmul snd_hda_core snd_rawmidi eeepc_wmi drm_ttm_helper llc ghash_clmulni_intel snd_hwdep snd_seq_device asus_wmi ttm sp5100_tco sha512_ssse3 snd_pcm sparse_keymap cfg80211 drm_display_helper sha256_ssse3 snd_timer platform_profile drm_kms_helper rfkill sha1_ssse3 i2c_piix4 wmi_bmof asus_ec_sensors snd video rapl jc42 mc efi_pstore ccp pcspkr k10temp i2c_smbus gpio_amdpt soundcore i2c_designware_platform regmap_i2c wmi gpio_generic joydev i2c_designware_core mac_hid nfsd auth_rpcgss nct6775 nfs_acl nct6775_core lockd drm hwmon_vid grace backlight msr fuse loop sunrpc i2c_core dm_mod configfs nfnetlink efivarfs sd_mod r8169 realtek crc32_pclmul xhci_pci mdio_devres ahci nvme crc32c_intel xhci_hcd libphy libahci nvme_core aesni_intel gf128mul crypto_simd cryptd
[  114.915984] ---[ end trace 0000000000000000 ]---
[  115.041833] RIP: 0010:usercopy_abort+0x78/0x7a
[  115.041847] Code: 16 b4 6d 9e eb 0e 48 c7 c2 63 58 70 9e 48 c7 c7 db 07 6d 9e 56 48 89 fe 48 c7 c7 10 8c 76 9e 51 48 89 c1 41 52 e8 c8 48 ff ff <0f> 0b 48 89 d9 49 89 e8 44 89 e2 31 f6 48 81 e9 00 00 00 9d 48 c7
[  115.041850] RSP: 0018:ffffcc907e0a7c70 EFLAGS: 00010282
[  115.041853] RAX: 000000000000006d RBX: ffff89226ceb1a00 RCX: 0000000000000000
[  115.041855] RDX: ffff8940de0edfc0 RSI: ffff8940de0e0940 RDI: ffff8940de0e0940
[  115.041860] RBP: 0000000000012160 R08: 0000000000000000 R09: ffffcc907e0a7b28
[  115.041865] R10: ffffffff9eb17ea8 R11: 0000000000000003 R12: 0000000000000001
[  115.041867] R13: ffff89226cec3b60 R14: ffffcc907e0a7e78 R15: ffffcc907e0a7de0
[  115.041869] FS:  00007f9d45e8c740(0000) GS:ffff8940de540000(0000) knlGS:0000000000000000
[  115.041871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  115.041872] CR2: 00007f7763fd31a8 CR3: 0000001edbfec000 CR4: 0000000000750ef0
[  115.041874] PKRU: 55555554

No idea how I'm getting those with ECC ram.

On backup the entire directory is toast in recent snapshots and the files are old and not touched for half a year:

$ ls minecraft/server-irina/world
ls: reading directory 'minecraft/server-irina/world': Input/output error

No dmesg errors this time. I compiled zfs builtin with debug, debuginfo, asan, ubsan. It looks it doesn't trap on any additional checks or may be they are not enabled if zfs is actually not in the module. May be I should build entire kernel with asan and ubsan? It looks like only one dataset is affected, so I'll try to migrate all the data from it. Smooth sailing with ZFS for 2 years and suddenly, I don't know how can I contain the corruption. Sounds like I have to copy everything off ZFS to be sure nothing dodgy happening in RAM.

IvanVolosyuk avatar Oct 26 '25 03:10 IvanVolosyuk

rsync'd data out and destroyed the bad dataset tripping on assert which made the pool only importable only in readonly:

VERIFY0(ba.ba_phys->bt_uncomp) failed (130048)
PANIC at bptree.c:298:bptree_iterate()
Showing stack for process 5081
CPU: 28 UID: 0 PID: 5081 Comm: txg_sync Tainted: G                T  6.12.53-x86_64-dirty #85
Tainted: [T]=RANDSTRUCT
Hardware name: ASUS System Product Name/TUF GAMING X670E-PLUS, BIOS 3287 09/20/2025
Call Trace:
 <TASK>
 dump_stack_lvl+0x4b/0x70
 spl_panic+0xd3/0xea
 ? dmu_buf_will_dirty_impl+0xd7/0x310
 ? dnode_partial_zero+0x109/0x180
 ? dnode_free_range+0x3fa/0x8a0
 ? dnode_rele_and_unlock+0x79/0x1e0
 bptree_iterate+0x346/0x4c0
 ? dsl_scan_obsolete_block_cb+0xc0/0xc0
 dsl_process_async_destroys+0xa5/0x720
 dsl_scan_sync+0x175/0xc20
 ? __list_del_entry+0x9/0x20
 ? list_del+0x9/0x20
 spa_sync_iterate_to_convergence+0x141/0x320
 spa_sync+0x2f8/0x8d0
 txg_sync_thread+0x274/0x3c0
 ? txg_is_quiescing+0x50/0x50
 ? spl_assert.constprop.0+0x20/0x20
 thread_generic_wrapper+0x63/0x90
 kthread+0xcf/0x100
 ? kthread_park+0x80/0x80
 ret_from_fork+0x31/0x50
 ? kthread_park+0x80/0x80
 ret_from_fork_asm+0x11/0x20
 </TASK>

IvanVolosyuk avatar Oct 26 '25 07:10 IvanVolosyuk

Interestingly, another dataset

  • I can succesfully rsync data out
  • zstreamdump doesn't find any errors
  • But it trips on receive

I'll try to disable rand struct kernel hardening to see if it related.

The assertion looks pretty strange though.

VERIFY3U(drrw->drr_logical_size, >, dn->dn_datablksz) failed (22528 > 131072)
PANIC at dmu_recv.c:2264:flush_write_batch_impl()
Showing stack for process 10396
CPU: 15 UID: 0 PID: 10396 Comm: receive_writer Tainted: G                T  6.12.52-x86_64-dirty #89
Tainted: [T]=RANDSTRUCT
Hardware name: ASUS System Product Name/TUF GAMING X670E-PLUS, BIOS 3287 09/20/2025
Call Trace:
 <TASK>
 dump_stack_lvl+0x4b/0x70
 spl_panic+0xd3/0xea
 ? txg_list_add+0xa1/0x100
 ? dsl_dir_dirty+0x38/0x80
 ? dsl_pool_dirty_space+0xbd/0x170
 ? kfree+0x20d/0x280
 flush_write_batch_impl+0x3c1/0x520
 flush_write_batch+0x36/0xa0
 receive_process_record+0xc1/0x350
 receive_writer_thread+0xbb/0x220
 ? spl_thread_exit+0x10/0x10
 ? spl_assert.constprop.0+0x20/0x20
 thread_generic_wrapper+0x63/0x90
 kthread+0xcf/0x100
 ? kthread_park+0x80/0x80
 ret_from_fork+0x31/0x50
 ? kthread_park+0x80/0x80
 ret_from_fork_asm+0x11/0x20
 </TASK>

IvanVolosyuk avatar Oct 26 '25 07:10 IvanVolosyuk

Cannot receive from any dataset where large blocks was used without tripping on the assertion. I'm rsync-ing my way out and starting all over from clean snapshots without larger blocks. How I ended here:

  • I had original pool with large blocks
  • I had 'archive' where I was sending with -L
  • I move original pool to another (v2) one with -R -L -e -c and it seemed have worked fine.
  • I could not send -L -R -e -c from v2 to v3 - corruption started.
  • I restored most of data from 'archive' to v3 with -L.
  • I kept syncing daily from v3 to archive.
  • I tried to move from v3 to v4 when I noticed corruption on v3 and archive.
  • zpool scrub was never finding any problems on v2, v3, v4.
  • I compiled zfs with -debug -asan -ubsan and noticed tripping assert on receive which now help me to contain corruption I believe.
  • I'm rsyncing data from all datasets with large blocks and starting clean without any large blocks in snapshots. I can send those without tripping on the asserts.
  • I'm going disable large blocks feature on the new pool v5 I guess to make sure I don't have of this taint in my data.
  • I think if try to delete the bad dataset from archive it will break it, so I have to migrate all the data off there somehow before cleanup.

IvanVolosyuk avatar Oct 26 '25 08:10 IvanVolosyuk

Just to be sure - you didn't over/underclock your CPU/RAM? I've played with under/overclocking some time ago (but with non-ECC RAM) and I believe CPU may have corrupted data in it's caches in these extreme cases (Had explicit corruptions with Intel on undervolting, silent corruptions on AMD Ryzen fclk (infinity fabric) overclock). But I don't think it's the case here.

gmelikov avatar Oct 26 '25 10:10 gmelikov

You have quite similar idea of what I had. How can everything be that broken, must be something with my system, but how come:

  • scrubs are completely clean?
  • i can recompile gentoo for hours without issues
  • first 'send/recv' was clean and tool less than an hour (second on the next hour was corrupted as hell and problems started)
  • and still scrub was clean

I have stock clocks for 7950x3d, also I have ECC RAM KSM56E46BD8KM-32HA which are capable of running at 5200, but runs at 3600 as memory controller for AMD suck with 4 DIMM. I don't mind that as I value stability. I had some issues with PSU which caused SSD disconnects due to voltage drops, but now I have 1600W PSU which holds voltages really well. I can build Gentoo world without any issues (this catches instability really well) and never had any corruption issues with ZFS before. As I was getting crazy with this kinda of corruptions I even ran memtest successfully for 12 hours, but it means nothing as people say.

I regained my sanity and trust in ZFS and my system with debug build. Now, with ZFS full debugging, asan, ubsan I'm really careful to not let the corruption spread. The failed assert clearly gives away that the code path taken is incorrect.

			/*
			 * The WRITE record is larger than the object's block
			 * size.  We must be receiving an incremental
			 * large-block stream into a dataset that previously did
			 * a non-large-block receive.  Lightweight writes must
			 * be exactly one block, so we need to decompress the
			 * data (if compressed) and do a normal dmu_write().
			 */
			ASSERT3U(drrw->drr_logical_size, >, dn->dn_datablksz);

My theory is ZFS forgets about large blocks feature after a couple of send/receive, especially if the recordsize is back to 128k (but some large blocks remain as it was different at some point). All my sends were automated and used -L flag. And this assert trips for a non-incremental replication stream.

It is not random, I'm getting the assertion tripped in receiver from any full replication send from dataset with large blocks I have (with or without -L now). I believe it is causing the corruption which manifests in lots of spectacular ways later. I can rsync from those datasets without any issue. I can send/recv from datasets without large blocks. At this point I think I contained the corruption and cleaning up the remaining datasets without large blocks. I might want to keep running debug ZFS from now on.

IvanVolosyuk avatar Oct 26 '25 14:10 IvanVolosyuk

Maybe related to https://github.com/openzfs/zfs/pull/10383 ? @ahrens maybe you may see something specific ^

gmelikov avatar Oct 27 '25 06:10 gmelikov

I haven't found easy reproducer for the corruption yet, but I can crash my debug build of ZFS-2.3.4 (Linux 6.12.56) with the following minimal reproducer:

#!/bin/bash

TEST_DS=exp/test

zfs destroy -r -f $TEST_DS
zfs create -p $TEST_DS
zfs create $TEST_DS/source1
zfs set recordsize=1M $TEST_DS/source1
dd if=/dev/urandom of=/$TEST_DS/source1/file bs=300000 count=1
zfs snapshot -r $TEST_DS/source1@snap1
zfs create $TEST_DS/replica
zfs send -R $TEST_DS/source1@snap1 | zfs recv -F $TEST_DS/replica/small

Causing the panic on debug check:

[  109.375242] VERIFY3U(drrw->drr_logical_size, >, dn->dn_datablksz) failed (37888 > 131072)
[  109.376540] PANIC at dmu_recv.c:2264:flush_write_batch_impl()
[  109.377542] Showing stack for process 1779
[  109.378068] CPU: 21 UID: 0 PID: 1779 Comm: receive_writer Not tainted 6.12.53-x86_64-dirty #93
[  109.378070] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20251019_163104-localhost 04/01/2014
[  109.378071] Call Trace:
[  109.378073]  <TASK>
[  109.378076]  dump_stack_lvl+0x4b/0x70
[  109.378080]  spl_panic+0xd3/0xea
[  109.378082]  ? _raw_spin_unlock+0xe/0x20
[  109.378083]  ? txg_list_add+0xa1/0x100
[  109.378085]  ? dsl_dir_dirty+0x38/0x80
[  109.378087]  ? _raw_spin_unlock+0xe/0x20
[  109.378088]  ? dsl_pool_dirty_space+0xbd/0x170
[  109.378089]  ? kfree+0x20d/0x280
[  109.378091]  flush_write_batch_impl+0x3c1/0x520
[  109.378092]  flush_write_batch+0x36/0xa0
[  109.378093]  receive_process_record+0xc1/0x350
[  109.378094]  receive_writer_thread+0xbb/0x220
[  109.378095]  ? spl_thread_exit+0x10/0x10
[  109.378095]  ? spl_assert.constprop.0+0x20/0x20
[  109.378096]  thread_generic_wrapper+0x63/0x90
[  109.378098]  kthread+0xcf/0x100
[  109.378099]  ? kthread_park+0x80/0x80
[  109.378100]  ret_from_fork+0x31/0x50
[  109.378102]  ? kthread_park+0x80/0x80
[  109.378102]  ret_from_fork_asm+0x11/0x20
[  109.378104]  </TASK>

I tested this in a VM. It's kinda surprising it crashes. Does it mean the code is not tested in testsuite? It doesn't look like a cause of the corruption to be fair. This is related to splitting blocks and was producing randomly broken streams when corruption was already there in my case above. Source of the corruption was probably something else.

I tried to figure out why stream with small blocks doesn't hit the assertion check, so I did zstreamdump -v for the reproducer above. The only difference is commented out line which enables large blocks:

zfs set recordsize=1M $TEST_DS/source1
--- small_blocks-stream-dump.txt        2025-11-09 21:42:51.395935243 +1100
+++ large_blocks-stream-dump.txt        2025-11-09 21:42:43.167928599 +1100
@@ -8,23 +8,24 @@
         toguid = 0
         fromguid = 0
         toname = exp/test/source1@snap1
-        payloadlen = 460
+        payloadlen = 500

 nvlist version: 0
         tosnap = snap1
         fss = (embedded nvlist)
         nvlist version: 0
-                0x1b02f3c3d4fa2e3 = (embedded nvlist)
+                0x5deb8cd7cc70082a = (embedded nvlist)
                 nvlist version: 0
                         name = exp/test/source1
                         parentfromsnap = 0x0
                         props = (embedded nvlist)
                         nvlist version: 0
+                                recordsize = 0x100000
                         (end props)

                         snaps = (embedded nvlist)
                         nvlist version: 0
-                                snap1 = 0xc975be984fbd8e9
+                                snap1 = 0x6928a91b8ad1ed46
                         (end snaps)

                         snapprops = (embedded nvlist)
@@ -35,19 +36,19 @@

                         (end snapprops)

-                (end 0x1b02f3c3d4fa2e3)
+                (end 0x5deb8cd7cc70082a)

         (end fss)

-END checksum = 118f67256f/57bd8457002/1350a3d109def/3895fcccd2b270
+END checksum = 120774305e/648f2311c7b/178dd4a901fad/476e5615874d65
 BEGIN record
         hdrtype = 1
         features = 4
         magic = 2f5bacbac
-        creation_time = 69106fb2
+        creation_time = 69107008
         type = 2
         flags = 0xc
-        toguid = c975be984fbd8e9
+        toguid = 6928a91b8ad1ed46
         fromguid = 0
         toname = exp/test/source1@snap1
         payloadlen = 0
@@ -57,10 +58,10 @@
 WRITE object = 1 type = 21 checksum type = 7 compression type = 0 flags = 0 offset = 0 logical_size = 512 compressed_size = 0 payload_size = 512 props = 200000000 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
 FREE object = 1 offset = 512 length = 1024
 OBJECT object = 2 type = 19 bonustype = 44 blksz = 131072 bonuslen = 176 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
-FREE object = 2 offset = 393216 length = -1
-WRITE object = 2 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 0 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
-WRITE object = 2 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 131072 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 200ff00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
-WRITE object = 2 type = 19 checksum type = 7 compression type = 0 flags = 0 offset = 262144 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = f004b00ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
+FREE object = 2 offset = 300032 length = -1
+WRITE object = 2 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 0 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
+WRITE object = 2 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 131072 logical_size = 131072 compressed_size = 0 payload_size = 131072 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
+WRITE object = 2 type = 19 checksum type = 2 compression type = 0 flags = 0 offset = 262144 logical_size = 37888 compressed_size = 0 payload_size = 37888 props = 0 salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
 FREEOBJECTS firstobj = 3 numobjs = 29
 OBJECT object = 32 type = 45 bonustype = 0 blksz = 512 bonuslen = 0 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
 FREE object = 32 offset = 512 length = -1
@@ -83,19 +84,19 @@
 WRITE object = 36 type = 47 checksum type = 7 compression type = 0 flags = 0 offset = 16384 logical_size = 16384 compressed_size = 0 payload_size = 16384 props = f0006001f salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000
 FREE object = 36 offset = 32768 length = 16384
 FREEOBJECTS firstobj = 37 numobjs = 27
-END checksum = 9466d6b01ca1/a0a225565980b0e8/5a67523caa934a55/7726d21184e587f8
+END checksum = 941141e56b74/6c7a4073420e6855/979e38b8779996bd/afa822331325f63a
 END checksum = 0/0/0/0
 SUMMARY:
-        Total DRR_BEGIN records = 2 (460 bytes)
+        Total DRR_BEGIN records = 2 (500 bytes)
         Total DRR_END records = 3 (0 bytes)
         Total DRR_OBJECT records = 7 (352 bytes)
         Total DRR_FREEOBJECTS records = 2 (0 bytes)
-        Total DRR_WRITE records = 10 (429568 bytes)
+        Total DRR_WRITE records = 10 (336384 bytes)
         Total DRR_WRITE_BYREF records = 0 (0 bytes)
         Total DRR_WRITE_EMBEDDED records = 0 (0 bytes)
         Total DRR_FREE records = 12 (0 bytes)
         Total DRR_SPILL records = 0 (0 bytes)
         Total records = 36
-        Total payload size = 430380 (0x6912c)
+        Total payload size = 337236 (0x52554)
         Total header overhead = 11232 (0x2be0)
-        Total stream length = 441612 (0x6bd0c)
+        Total stream length = 348468 (0x55134)

It looks like when large_blocks are enabled for the send stream WRITE object uses logical_size = 37888 payload_size = 37888, while without large_blocks: logical_size = 37888 payload_size = 37888. On receiver the stream seem to be breaking assumption that logical size can only be larger than data block size and only for large_blocks feature, so it goes into a branch using dmu_write_by_dnode() instead of dmu_lightweight_write_by_dnode().

		if (drrw->drr_logical_size != dn->dn_datablksz) {
			/*
			 * The WRITE record is larger than the object's block
			 * size.  We must be receiving an incremental
			 * large-block stream into a dataset that previously did
			 * a non-large-block receive.  Lightweight writes must
			 * be exactly one block, so we need to decompress the
			 * data (if compressed) and do a normal dmu_write().
			 */
			ASSERT3U(drrw->drr_logical_size, >, dn->dn_datablksz); <- crashing here

Interestingly, dmu_write_by_node() is not used anywhere else:

/*
 * This interface is not used internally by ZFS but is provided for
 * use by Lustre which is built on the DMU interfaces.
 */
int
dmu_write_by_dnode_flags(dnode_t *dn, uint64_t offset, uint64_t size,
    const void *buf, dmu_tx_t *tx, uint32_t flags)
{
	dmu_buf_t **dbp;
	int numbufs;
	int error;

	if (size == 0)
		return (0);

	/* Allow Direct I/O when requested and properly aligned */
	if ((flags & DMU_DIRECTIO) && zfs_dio_page_aligned((void *)buf) &&
	    zfs_dio_aligned(offset, size, dn->dn_datablksz)) {
		abd_t *data = abd_get_from_buf((void *)buf, size);
		error = dmu_write_abd(dn, offset, size, data, DMU_DIRECTIO, tx);
		abd_free(data);
		return (error);
	}

	VERIFY0(dmu_buf_hold_array_by_dnode(dn, offset, size,
	    FALSE, FTAG, &numbufs, &dbp, DMU_READ_PREFETCH));
	dmu_write_impl(dbp, numbufs, offset, size, buf, tx);
	dmu_buf_rele_array(dbp, numbufs, FTAG);
	return (0);
}

int
dmu_write_by_dnode(dnode_t *dn, uint64_t offset, uint64_t size,
    const void *buf, dmu_tx_t *tx)
{
	return (dmu_write_by_dnode_flags(dn, offset, size, buf, tx, 0));
}

@ahrens can you have a look, you did the changes for large block send streams, so may be you know what is the correct behavior?

IvanVolosyuk avatar Oct 31 '25 12:10 IvanVolosyuk

Well, for the 2 weeks after I stopped using large blocks the system is rock solid on 2.3.4. I'm happy that the story is over. I did a couple of migrations to the desired layout of my pools without any issues. I'm using it normally. My confidence in ZFS and my system is recovering, but I no longer trust scrub to detect all data problem, so I have written a small program which reads all files and reports any IO errors. I run it periodically in addition to scrub for additional safety.

Retrospectively, I would say the issues started to happen after 'zfs send' migration. The later problems seem to be caused by an corruption which happening during that process, which manifested in surprising flaky sends, accessing past end of object and unimportable (in debug ZFS build) pool after deleting of the affected datasets. The debug assertion failure I was hitting during data recovery has no direct relation to the corruption and seem just a bug in split-block handling for large block send without -L (see reproducer above). I tried to find a fix by hacking the sender code, but it looks like the receiver code is wrong and doesn't expect a split-block smaller than datablksz. Having this bug in debug build makes me wonder if the code is untested and potential cause of hard to reproduce problem I've encountered.

IvanVolosyuk avatar Nov 13 '25 12:11 IvanVolosyuk

Were you ever able to reproduce the corruption when not using the -e option for the send stream?

That stood out to me in your earlier transcripts, and might be something to isolate if you managed to figure out a way to reproduce.

dodexahedron avatar Dec 13 '25 04:12 dodexahedron