zfs
zfs copied to clipboard
zfs send -nv overflow
zrepl
uses zfs send -nv
to estimate the send size and visualize it in the UI.
Users reported a legitimate zrepl bug where we incorrectly parse the size estimate as an int64_t
whereas it is actually a uint64_t
.
However, upon inspection, the reported size estimate is clearly not correct, e.g., 18446744073614344448
bytes.
This suggests some kind of integer overflow within the size estimation code.
The dry send in question is incremental, snap -> snap
, most likely resumed.
The issue seems to affect OpenZFS 2.0 and newer. I believe all affected users are on Linux, but that is likely just coincidal.
We currently have a user ( @3nprob ) who is able to let the situation sit for a few days so that a ZFS dev familiar with the zfs send code can tell them what zdb
commands to run: https://github.com/zrepl/zrepl/issues/463#issuecomment-865514363
My current theory on this is that there is a bug in zfs_send_resume_impl
where the raw
and saved
flags aren't copied into the flags that are used for size estimation. That bug definitely exists, the only question is if it's the root cause of this particular problem.
@3nprob, are you able to try the commit above ( https://github.com/openzfs/zfs/pull/12266 ) on your system?
I'm seeing if I can build it now, will report back.
Built 2.0.4
with #12266 cherry-picked and running on receiver. No difference.
I assume the issue is on the sender side here, but unfortunately the senders where this happens are production systems, so a bit hesitant to installing a custom build here (following Debian build instructions don't result in the same set of packages as the deb repos) so I guess I have to wait for a packaged release.
Thanks for the support either way (:
It has to run on the sender, the patch affects only the send path.
You could look into mechanisms to make DKMS apply custom patches (assuming you are running the Debian DKMS package).
I ran into this issue when replicating ZFS snapshots from my workstation (running Fedora 34) to a local NAS system (running Debian with ZFS 2.0.3). The send/receive of the snapshots is managed by zrepl.
The issue of the incorrect size estimate initially occurred with ZFS 2.0.5 on the client (i.e. sender), but it persists even after an update to ZFS 2.1.0 (both installed via the ZFS yum/dnf repo).
@pcd1193182 any ideas what else could be going wrong here? @problame suggested (https://github.com/zrepl/zrepl/issues/463#issuecomment-882397636) to restart the transfer by removing the partially received state, but I thought to first check here whether there is something else that I could try to debug this issue.
@awehrfritz just to be clear: did the sending-side ZFS kernel module actually include the patch https://github.com/openzfs/zfs/pull/12266/commits/5d1f961e9268d79f5ddc70521848ce5986b136be ?
Yes, I stated this already in my previous comment (https://github.com/zrepl/zrepl/issues/463#issuecomment-882026080), the patch (https://github.com/openzfs/zfs/commit/620389d32f6bd5676fd0797f2e3314793db6f1e2) was backported to the 2.1-release branch prior to the 2.1.0 release.
I recently updated my Fedora workstation (i.e. the sending side in my setup) to version 2.1.0 from the zfs-testing
repo provided by openzfs. (I also checked the source code in the zfs-dkms package and can confirm that the above patch is indeed included!)
@awehrfritz It would be useful if we could get the output of the same commands I asked for in the other thread (https://github.com/zrepl/zrepl/issues/463#issuecomment-866192806); obviously replace the dataset names with the ones that are having issues in your setup.
Thanks @pcd1193182 for your advice. Below some of the output you requested. For the first two commands, the output would be huge (a few hundred MB), so I reduced the verbosity and truncated the output. I can make the full output available for you via private link if you really need that.
I am not sure what you are looking for, but most of this seems reasonable to me. Specifically, the size of the dataset to be sent is 118 MB, which shouldn't be a problem. The only thing that stuck out were 3 data errors reported in the zdb
output, not sure what to make of this. The last scrub (about 2 weeks ago) didn't find any errors.
root@sender:~# zdb -v hpool/home/arminw@zrepl_20210705_144637_000
Dataset hpool/home/arminw@zrepl_20210705_144637_000 [ZPL], ID 1156, cr_txg 13963186, 77.2G, 507999 objects
Object lvl iblk dblk dsize dnsize lsize %full type
0 6 128K 16K 196M 512 688M 36.05 DMU dnode
-1 1 128K 1K 0 512 1K 100.00 ZFS user/group/project used
-2 1 128K 2K 0 512 2K 100.00 ZFS user/group/project used
-3 1 128K 512 0 512 512 100.00 ZFS user/group/project used
1 1 128K 512 8K 512 512 100.00 ZFS master node
2 2 128K 16K 32K 512 32K 100.00 ZFS directory
...
1393288 1 128K 512 0 512 512 100.00 ZFS directory
1393289 1 128K 512 4K 512 512 100.00 ZFS plain file
Dnode slots:
Total used: 507999
Max used: 1393289
Percent empty: 63.539582
ZFS_DBGMSG(zdb) START:
spa.c:5155:spa_open_common(): spa_open_common: opening hpool/home/arminw@zrepl_20210705_144637_000
spa_misc.c:418:spa_load_note(): spa_load(hpool, config trusted): LOADING
vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/disk/by-id/nvme-PC300_NVMe_SK_hynix_512GB_XXXXXXXXXXXXXXXXX-part5': best uberblock found for spa hpool. txg 14427548
spa_misc.c:418:spa_load_note(): spa_load(hpool, config untrusted): using uberblock with txg=14427548
spa_misc.c:418:spa_load_note(): spa_load(hpool, config trusted): spa_load_verify found 0 metadata errors and 3 data errors
spa.c:8336:spa_async_request(): spa=hpool async request task=2048
spa_misc.c:418:spa_load_note(): spa_load(hpool, config trusted): LOADED
ZFS_DBGMSG(zdb) END
root@sender:~# zfs send -Lce -i hpool/home/arminw@zrepl_20210705_144637_000 hpool/home/arminw@zrepl_20210705_154637_000 | zstreamdump
BEGIN record
hdrtype = 1
features = 430004
magic = 2f5bacbac
creation_time = 60e3295d
type = 2
flags = 0xc
toguid = 3609ae3707cc9c2a
fromguid = a16cb0796bae06b
toname = hpool/home/arminw@zrepl_20210705_154637_000
payloadlen = 0
END checksum = bd6c1df4a3163d/c19fcb7472cef3a3/a0928d305f8b84a6/4f51d122ea6926b3
SUMMARY:
Total DRR_BEGIN records = 1 (0 bytes)
Total DRR_END records = 1 (0 bytes)
Total DRR_OBJECT records = 3904 (1092424 bytes)
Total DRR_FREEOBJECTS records = 986 (0 bytes)
Total DRR_WRITE records = 2393 (125101568 bytes)
Total DRR_WRITE_BYREF records = 0 (0 bytes)
Total DRR_WRITE_EMBEDDED records = 21 (1312 bytes)
Total DRR_FREE records = 4043 (0 bytes)
Total DRR_SPILL records = 0 (0 bytes)
Total records = 11349
Total payload size = 126195304 (0x7859668)
Total header overhead = 3540888 (0x360798)
Total stream length = 129736192 (0x7bb9e00)
root@sender:~# zfs send -nv -Lce -i hpool/home/arminw@zrepl_20210705_144637_000 hpool/home/arminw@zrepl_20210705_154637_000
send from hpool/home/arminw@zrepl_20210705_144637_000 to hpool/home/arminw@zrepl_20210705_154637_000 estimated size is 118M
total estimated size is 118M
root@receiver:~# zfs get -o value receive_resume_token storage/backups/linux-aw0/hpool/home/arminw
VALUE
1-14b724fe22-100-789c636064000310a501c49c50360710a715e5e7a69766a6304041f6835dd3d84f8b712900d9ec48eaf293b252934b1818ae45f080d561c8a7a515a7968064aec1e5d990e4932a4b528b8174c23e312e6cfa4bf221aed09a7386dd7c1da79907923c27583e2f3137958121a3203f3f473f233f37553fb1283733afdca1aa28b52027dec8c0c8d0c0dcc034ded0d4c4ccd83cdec0c000e6250600fad9281b
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
Not sure whether Paul’s change fixed this
Reopening since it's unclear if this was entirely resolved.