zfs icon indicating copy to clipboard operation
zfs copied to clipboard

ZFS panic in ddt_object_remove(): VERIFY(...) failed during pool operation and boot import on FreeBSD 14.1 with dedup=on

Open davidlinden02 opened this issue 6 months ago • 10 comments

System information

Type Version/Name
Distribution Name FreeBSD
Distribution Version 14.1-RELEASE-p7
Kernel Version 14.1-RELEASE-p7
Architecture amd64
OpenZFS Version zfs-2.2.4-FreeBSD_g256659204, zfs-kmod-2.2.4-FreeBSD_g256659204

Describe the problem you're observing

We are experiencing critical panics on two separate FreeBSD 14.1 systems using OpenZFS 2.2.4. The crashes occur both:

  • During normal runtime
  • And again during boot, likely when the system attempts to auto-import the pool

The panic message on both systems is identical:

panic: VERIFY(ddt_object_remove(ddt, otype, oclass, dde, tx) == 0) failed

All affected systems are using dedup=on. Once the crash occurs, the systems become unbootable without intervention (boot loop due to panic during pool import).

Additionally, similar symptoms are now emerging on other servers configured similarly.

Describe how to reproduce the problem

The bug is currently only reproducible on the affected systems. General steps:

  1. System crashes with kernel panic in ddt_object_remove()
  2. After reboot, system crashes again during early boot when importing the poo

How was the pool originally created?
zpool create zroot raidz /dev/ada0p3 /dev/ada1p3 /dev/ada2p3 /dev/ada3p3

Include any warning/errors/backtraces from the system logs

Boot log excerpt:

Starting file system checks:
panic: VERIFY(ddt_object_remove(ddt, otype, oclass, dde, tx) == 0) failed

Backtrace (identical across both servers):

panic: VERIFY(ddt_object_remove(...) == 0) failed
ddt_sync()
dsl_scan_sync()
spa_sync()
txg_sync_thread()
...
Stopped at kdb_enter+0x33: movq $0,0xa20612(%rip)

zpool import -o readonly=on -o failmode=continue -N -f zroot gives read-only access without panic.

Additional notes

  • Systems affected: 2 (identical issue), early signs on additional hosts
  • All use dedup=on
  • zpool status showed no degraded or failed devices
  • Full logs unavailable due to immediate panic, but full screenshots are attached

Attachments

4 screenshots showing:

  • Crash during runtime on both servers
  • Crash during boot on both servers
Image Image Image Image

davidlinden02 avatar Jul 10 '25 16:07 davidlinden02

Is there anything common between those two systems/pools? I suppose they are not clones of each-other?

amotin avatar Jul 11 '25 02:07 amotin

Is there anything common between those two systems/pools? I suppose they are not clones of each-other?

Both systems have the same configuration. They are not clones of each other.

davidlinden02 avatar Jul 11 '25 11:07 davidlinden02

I wonder what error is returned there by ddt_object_remove(). Wish instead of VERIFY(... == 0) there was VERIFY0(...), so that it would report. If it is some EIO, then it could be some corruption, but that is unlikely for two unrelated systems. If there is ENOENT, then I wonder if it is possible to have something deduped and then removed in the same TXG. I need to look deeper into the code, but I wonder, do you use block cloning on those systems, in case it may be a factor? Any idea what you might delete/overwrite when it happens?

PS: Looking closer, my guess about creation and deletion in the same TXG might be wrong. If the entry was not read from disk, then it won't have type different from DDT_TYPES, and so ddt_object_remove() won't be called. But if it was read from disk, then it is there and we should be able to delete it. Odd. I don't see how could it happen outside of metadata I/O error.

amotin avatar Jul 11 '25 15:07 amotin

@robn It seems DDT code could be more careful about ZAP read errors handling. In both the old version and the new versions ddt_lookup() ignores ZAP read errors, which may result in something less recoverable later during table sync. We could probably just disable dedup for new writes and leak space for frees if we can't read the ZAP. Though it does not explain how can it happen on two separate systems.

amotin avatar Jul 11 '25 17:07 amotin

@amotin I'm attaching a screenshot with import on FreeBSD current + openzfs 2.3 (FreeBSD version). We don't use explicit dataset clones (zfs clone), but the system actively uses snapshots. In response to the question "Any idea what you might be deleting or overwriting when it happens?": unfortunately, I don't know - many users have access to the system.

Image

davidlinden02 avatar Jul 12 '25 18:07 davidlinden02

97 means ECKSUM, so it really can't read some record from DDT ZAP. We definitely should somehow improve error handling there, but it does not explain how we got into this situation. If you are able to still somehow import the pool, have you run a scrub on it? Does it report some errors too?

amotin avatar Jul 12 '25 20:07 amotin

@amotin On FreeBSD 15, the system hits a kernel panic, while on version 14.2 it just hangs, it's been stuck for 10 hours now. Even though a scrub usually finishes on this server in around 3 hours.

Image Image

davidlinden02 avatar Jul 13 '25 12:07 davidlinden02

Scrub does not work on read-only imported pool. See https://github.com/openzfs/zfs/issues/14481 and https://github.com/openzfs/zfs/issues/17527 .

amotin avatar Jul 14 '25 22:07 amotin

It seems the same panic happened to me a few months ago while I was using FreeBSD 14.1. Recreating the affected dataset helped, and the system is still usable without issues. I'm thinking about what to do if it appears again. Is there a way to mark files "as broken" (just to show them in zpool status -v) instead of panicking?

Image

avkarenow avatar Aug 01 '25 17:08 avkarenow

I’ve started to experience panic more frequently now (on multiple installations...), so I’ve been using the following workarounds:

For ddt_object_update == 97: It gives me a few seconds before panic to disable dedup after the import (zfs set dedup=off dataset). After disabling dedup, the import becomes possible.

For ddt_object_remove: Removing VERIFY0() from VERIFY0(ddt_object_remove(ddt, otype, oclass, ddk, tx)); allows the import to succeed again.

In both cases, the scrub doesn’t find any errors and pool works without issues.

I'm thinking about the best way to handle these cases:

  1. Disable DDT updates on checksum errors?
  2. Force remove the DDT object even if the checksum is incorrect?

Any ideas would be much appreciated.

avkarenow avatar Dec 05 '25 16:12 avkarenow