Panic: `hment_remove()` missing in hash table
Early this morning we had a panic on a production system. Here's the basic info from the coredump:
> ::status
debugging crash dump vmcore.1 (64-bit) from redacted
operating system: 5.11 joyent_20240701T205528Z (i86pc)
git branch: release-20240627
git rev: 2f58d204ecaf3ab0c8f65399a0c80861c64dfa48
image uuid: (not set)
panic message: hment_remove() missing in hash table pp=fffffe0048e4a930, ht=fffffeb21cc90928,entry=0x1d6 hash index=0x3da5
dump content: kernel pages only
> ::stack
vpanic()
0xfffffffffb83578c()
hat_pte_unmap+0x16d(fffffeb21cc90928, 1d6, 10, 8000000995d77067, 0, 0)
hat_unload_callback+0xdf(fffffeb3afee2490, 7fffe6000000, 200000, 10, 0)
segvn_unmap+0x56e(fffffeb330a9b9c8, 7fffe6000000, 200000)
as_free+0xe9(fffffeb3d35bba18)
relvm+0x1fc()
proc_exit+0x4a7(3, b)
exit+0x15(3, b)
psig+0x33c()
trap+0x11d5(fffffe00f937cf10, 6f193758, 1)
cmntrap_pushed+0x3c()
> ::stacks
…
fffffeb1f4456880 PANIC <NONE> 1
apix_setspl+0x22
do_splx+0x84
_resume_from_idle+0x12b
preempt+0xfa
kpreempt+0x3c
sys_rtt_common+0x208
_sys_rtt_ints_disabled+8
zone_rm_page+0x1f
0xfffffffffb83578c
hat_pte_unmap+0x16d
hat_unload_callback+0xdf
segvn_unmap+0x56e
as_free+0xe9
relvm+0x1fc
proc_exit+0x4a7
exit+0x15
psig+0x33c
trap+0x11d5
> ::msgbuf
…
panic[cpu2]/thread=fffffeb1f4456880:
hment_remove() missing in hash table pp=fffffe0048e4a930, ht=fffffeb21cc90928,entry=0x1d6 hash index=0x3da5
fffffe00f937c980 unix:kpti_tramp_end+2978c ()
fffffe00f937ca10 unix:hat_pte_unmap+16d ()
fffffe00f937cb80 unix:hat_unload_callback+df ()
fffffe00f937cc70 genunix:segvn_unmap+56e ()
fffffe00f937ccd0 genunix:as_free+e9 ()
fffffe00f937cd00 genunix:relvm+1fc ()
fffffe00f937cd80 genunix:proc_exit+4a7 ()
fffffe00f937cda0 genunix:exit+15 ()
fffffe00f937ce20 genunix:psig+33c ()
fffffe00f937cf00 unix:trap+11d5 ()
fffffe00f937cf10 unix:cmntrap+e9 ()
NOTICE: ahci0: ahci_tran_reset_dport port 3 reset port
NOTICE: ahci0: ahci_tran_reset_dport port 0 reset port
NOTICE: ahci0: ahci_tran_reset_dport port 1 reset port
NOTICE: ahci0: ahci_tran_reset_dport port 2 reset port
I found this ancient, unsolved bug report with a similar stacktrace. Similar to that bug report, we ran memtest86+ on this system prior to putting it into production several months ago and it didn't indicate any problems.
I can't provide the full coredump since it's a production system that contains customer data, but I'd be happy to run additional mdb commands to provide more info if that would help. (I'm not quite sure where to look next.)
Is that system using ECC RAM? Memory problems (if it is memory problem) can appear at any time, so several months old memtest run is only telling that memory was ok during the time of the test.
Hi, Toomas; thanks for your reply.
Is that system using ECC RAM?
Yes, the CPU (Intel Xeon E-2378) and logic board (Asus P12R-M) are ECC-capable, and its RAM sticks (32GB PC4-25600 3200MHz DDR4 ECC UDIMM) are ECC.
In SmartOS, is there a way to verify that ECC is actually enabled? (I checked sysinfo and prtdiag -v but didn't see any obvious indication.)
Memory problems (if it is memory problem) can appear at any time, so several months old memtest run is only telling that memory was ok during the time of the test.
Ah, good point. I'll try to schedule some downtime to run memtest again.
Also check fmadm faulty since it's ECC, you might see something there.
Also check
fmadm faultysince it's ECC, you might see something there.
Thanks; I checked that (and fmdump -v and /var/adm/messages) and didn't find any messages about ECC errors.
ok, no ECC errors and two similar stack traces... tbh, I find it hard to believe it is memory corruption. Exactly the same function pointer getting zeroed in those two cases? Anyhow, it means we need to do some digging.
I checked
fmadm faulty(andfmdump -vand/var/adm/messages) and didn't find any messages about ECC errors.
Oh! I found a series of Uncorrectable ECC errors in the system's IPMI Event Log, starting about 7 minutes prior to the panic:
# ipmitool sel list
…
10d1 | 09/22/2024 | 11:40:16 | Memory #0xd1 | Uncorrectable ECC | Asserted
So it turns out there is (at least) a hardware problem.
Considering the timing correlation, this panic now looks likely to have been caused by hardware failure. And since it's only the second known occurrence of this panic in 13 years (the prior being Illumos bug #1034 mentioned above), I'll tentatively close this issue — but of course feel free to reopen it if you'd still like to investigate further to see whether there's also a software problem.