illumos-joyent icon indicating copy to clipboard operation
illumos-joyent copied to clipboard

Panic: NULL pointer dereference in ZFS dva_get_dsize_sync

Open smokris opened this issue 5 years ago • 6 comments

One of my systems recently experienced some minor disk corruption, and panicked:

> ::status
debugging crash dump vmcore.0 (64-bit) from […]
operating system: 5.11 joyent_20181011T004530Z (i86pc)
image uuid: (not set)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe000e652780 addr=2a8 occurred in module "zfs" due to a NULL pointer dereference
dump content: kernel pages only
> $C
fffffe000e652890 dva_get_dsize_sync+0x5d(fffffe0910374000, fffffe0a64cd35c0)
fffffe000e6528e0 bp_get_dsize_sync+0xab(fffffe0910374000, fffffe0a64cd35c0)
fffffe000e652950 dbuf_write_ready+0x71(fffffe0a64cd3450, fffffe0916b548f0, fffffe09353e8ec8)
fffffe000e6529c0 arc_write_ready+0x120(fffffe0a64cd3450)
fffffe000e652a20 zio_ready+0x5b(fffffe0a64cd3450)
fffffe000e652a50 zio_execute+0x7f(fffffe0a64cd3450)
fffffe000e652b10 taskq_thread+0x2d0(fffffe090fde64c8)
fffffe000e652b20 thread_start+8()

I'm not able to reproduce the panic (I've since rolled back a few transactions in order to get the pool running again), but I do have the vmdump.

The stack trace is similar to https://github.com/zfsonlinux/zfs/issues/1891, which references https://www.illumos.org/issues/5349. I confirmed that the fix to 5349 is present in the illumos-joyent fork (f63ab3d)… so it's surprising that this panic happened due to a NULL dereference rather than one of the more specific panic messages in zfs_blkptr_verify.

smokris avatar Mar 05 '19 20:03 smokris

Has anyone reached out to get the dump that occurred so we can try to better investigate what's going on? In terms of minor disk corruption, do you have any idea what the source of that was?

rmustacc avatar Mar 11 '19 23:03 rmustacc

Has anyone reached out to get the dump that occurred so we can try to better investigate what's going on?

No, not yet.

In terms of minor disk corruption, do you have any idea what the source of that was?

Prior to the panic, zpool scrub had reported a few checksum errors. This pool hasn't had panics or other unclean shutdowns prior to that, so I assume it's just typical bit rot.

smokris avatar Mar 12 '19 17:03 smokris

Ack, it just happened again. Is there any particular info that would be helpful for me to collect, before I try rolling back transactions again?

smokris avatar Mar 15 '19 18:03 smokris

After the March 15 panic, I'm unable import the pool, even if I roll back all available TXGs:

zpool import -o readonly=on           zones   # panic: "blkptr at … has invalid CHECKSUM 0"
zpool import -o readonly=on -T 515235 zones   # panic: "blkptr at … has invalid CHECKSUM 0"
zpool import -o readonly=on -T 515234 zones   # "one or more devices is currently unavailable"

I also tried booting into Ubuntu Server 18.04 to see if ZoL could import it (it couldn't; same panic), and to see if https://gist.github.com/jshoward/5685757 could destroy the bad blocks to enable it to import (it destroyed them but it still panicked when importing).

Lacking other recovery options, I destroyed and rebuilt the pool. That took the core dump with it, so I no longer have information beyond what I've already posted on this issue.

smokris avatar Mar 25 '19 15:03 smokris

This afternoon it panicked again — same panic message and stack trace as original post — so I have another core dump available (which I've copied to another system, in case the pool gets corrupted again).

smokris avatar Mar 29 '19 18:03 smokris

Hi @smokris. Thanks for filing this issue, and sorry for the delay in tracking this down. It'd be great if we could get access to one of the crash dumps from this system. I'll follow up with you over email with instructions on how to get that dump to us.

KodyKantor avatar Apr 05 '19 14:04 KodyKantor