ZFS panic in ddt_object_remove(): VERIFY(...) failed during pool operation and boot import on FreeBSD 14.1 with dedup=on
System information
| Type | Version/Name |
|---|---|
| Distribution Name | FreeBSD |
| Distribution Version | 14.1-RELEASE-p7 |
| Kernel Version | 14.1-RELEASE-p7 |
| Architecture | amd64 |
| OpenZFS Version | zfs-2.2.4-FreeBSD_g256659204, zfs-kmod-2.2.4-FreeBSD_g256659204 |
Describe the problem you're observing
We are experiencing critical panics on two separate FreeBSD 14.1 systems using OpenZFS 2.2.4. The crashes occur both:
- During normal runtime
- And again during boot, likely when the system attempts to auto-import the pool
The panic message on both systems is identical:
panic: VERIFY(ddt_object_remove(ddt, otype, oclass, dde, tx) == 0) failed
All affected systems are using dedup=on. Once the crash occurs, the systems become unbootable without intervention (boot loop due to panic during pool import).
Additionally, similar symptoms are now emerging on other servers configured similarly.
Describe how to reproduce the problem
The bug is currently only reproducible on the affected systems. General steps:
- System crashes with kernel panic in
ddt_object_remove() - After reboot, system crashes again during early boot when importing the poo
How was the pool originally created?
zpool create zroot raidz /dev/ada0p3 /dev/ada1p3 /dev/ada2p3 /dev/ada3p3
Include any warning/errors/backtraces from the system logs
Boot log excerpt:
Starting file system checks:
panic: VERIFY(ddt_object_remove(ddt, otype, oclass, dde, tx) == 0) failed
Backtrace (identical across both servers):
panic: VERIFY(ddt_object_remove(...) == 0) failed
ddt_sync()
dsl_scan_sync()
spa_sync()
txg_sync_thread()
...
Stopped at kdb_enter+0x33: movq $0,0xa20612(%rip)
zpool import -o readonly=on -o failmode=continue -N -f zroot gives read-only access without panic.
Additional notes
- Systems affected: 2 (identical issue), early signs on additional hosts
- All use dedup=on
-
zpool statusshowed no degraded or failed devices - Full logs unavailable due to immediate panic, but full screenshots are attached
Attachments
4 screenshots showing:
- Crash during runtime on both servers
- Crash during boot on both servers
Is there anything common between those two systems/pools? I suppose they are not clones of each-other?
Is there anything common between those two systems/pools? I suppose they are not clones of each-other?
Both systems have the same configuration. They are not clones of each other.
I wonder what error is returned there by ddt_object_remove(). Wish instead of VERIFY(... == 0) there was VERIFY0(...), so that it would report. If it is some EIO, then it could be some corruption, but that is unlikely for two unrelated systems. If there is ENOENT, then I wonder if it is possible to have something deduped and then removed in the same TXG. I need to look deeper into the code, but I wonder, do you use block cloning on those systems, in case it may be a factor? Any idea what you might delete/overwrite when it happens?
PS: Looking closer, my guess about creation and deletion in the same TXG might be wrong. If the entry was not read from disk, then it won't have type different from DDT_TYPES, and so ddt_object_remove() won't be called. But if it was read from disk, then it is there and we should be able to delete it. Odd. I don't see how could it happen outside of metadata I/O error.
@robn It seems DDT code could be more careful about ZAP read errors handling. In both the old version and the new versions ddt_lookup() ignores ZAP read errors, which may result in something less recoverable later during table sync. We could probably just disable dedup for new writes and leak space for frees if we can't read the ZAP. Though it does not explain how can it happen on two separate systems.
@amotin I'm attaching a screenshot with import on FreeBSD current + openzfs 2.3 (FreeBSD version). We don't use explicit dataset clones (zfs clone), but the system actively uses snapshots. In response to the question "Any idea what you might be deleting or overwriting when it happens?": unfortunately, I don't know - many users have access to the system.
97 means ECKSUM, so it really can't read some record from DDT ZAP. We definitely should somehow improve error handling there, but it does not explain how we got into this situation. If you are able to still somehow import the pool, have you run a scrub on it? Does it report some errors too?
@amotin On FreeBSD 15, the system hits a kernel panic, while on version 14.2 it just hangs, it's been stuck for 10 hours now. Even though a scrub usually finishes on this server in around 3 hours.
Scrub does not work on read-only imported pool. See https://github.com/openzfs/zfs/issues/14481 and https://github.com/openzfs/zfs/issues/17527 .
It seems the same panic happened to me a few months ago while I was using FreeBSD 14.1. Recreating the affected dataset helped, and the system is still usable without issues. I'm thinking about what to do if it appears again. Is there a way to mark files "as broken" (just to show them in zpool status -v) instead of panicking?
I’ve started to experience panic more frequently now (on multiple installations...), so I’ve been using the following workarounds:
For ddt_object_update == 97: It gives me a few seconds before panic to disable dedup after the import (zfs set dedup=off dataset). After disabling dedup, the import becomes possible.
For ddt_object_remove: Removing VERIFY0() from VERIFY0(ddt_object_remove(ddt, otype, oclass, ddk, tx)); allows the import to succeed again.
In both cases, the scrub doesn’t find any errors and pool works without issues.
I'm thinking about the best way to handle these cases:
- Disable DDT updates on checksum errors?
- Force remove the DDT object even if the checksum is incorrect?
Any ideas would be much appreciated.