illumos-joyent Panic: `hment_remove()` missing in hash table

Early this morning we had a panic on a production system. Here's the basic info from the coredump:

> ::status
debugging crash dump vmcore.1 (64-bit) from redacted
operating system: 5.11 joyent_20240701T205528Z (i86pc)
git branch: release-20240627
git rev: 2f58d204ecaf3ab0c8f65399a0c80861c64dfa48
image uuid: (not set)
panic message: hment_remove() missing in hash table pp=fffffe0048e4a930, ht=fffffeb21cc90928,entry=0x1d6 hash index=0x3da5
dump content: kernel pages only

> ::stack
vpanic()
0xfffffffffb83578c()
hat_pte_unmap+0x16d(fffffeb21cc90928, 1d6, 10, 8000000995d77067, 0, 0)
hat_unload_callback+0xdf(fffffeb3afee2490, 7fffe6000000, 200000, 10, 0)
segvn_unmap+0x56e(fffffeb330a9b9c8, 7fffe6000000, 200000)
as_free+0xe9(fffffeb3d35bba18)
relvm+0x1fc()
proc_exit+0x4a7(3, b)
exit+0x15(3, b)
psig+0x33c()
trap+0x11d5(fffffe00f937cf10, 6f193758, 1)
cmntrap_pushed+0x3c()

> ::stacks
…
fffffeb1f4456880 PANIC    <NONE>                  1
                 apix_setspl+0x22
                 do_splx+0x84
                 _resume_from_idle+0x12b
                 preempt+0xfa
                 kpreempt+0x3c
                 sys_rtt_common+0x208
                 _sys_rtt_ints_disabled+8
                 zone_rm_page+0x1f
                 0xfffffffffb83578c
                 hat_pte_unmap+0x16d
                 hat_unload_callback+0xdf
                 segvn_unmap+0x56e
                 as_free+0xe9
                 relvm+0x1fc
                 proc_exit+0x4a7
                 exit+0x15
                 psig+0x33c
                 trap+0x11d5

> ::msgbuf
…
panic[cpu2]/thread=fffffeb1f4456880:
hment_remove() missing in hash table pp=fffffe0048e4a930, ht=fffffeb21cc90928,entry=0x1d6 hash index=0x3da5
fffffe00f937c980 unix:kpti_tramp_end+2978c ()
fffffe00f937ca10 unix:hat_pte_unmap+16d ()
fffffe00f937cb80 unix:hat_unload_callback+df ()
fffffe00f937cc70 genunix:segvn_unmap+56e ()
fffffe00f937ccd0 genunix:as_free+e9 ()
fffffe00f937cd00 genunix:relvm+1fc ()
fffffe00f937cd80 genunix:proc_exit+4a7 ()
fffffe00f937cda0 genunix:exit+15 ()
fffffe00f937ce20 genunix:psig+33c ()
fffffe00f937cf00 unix:trap+11d5 ()
fffffe00f937cf10 unix:cmntrap+e9 ()
NOTICE: ahci0: ahci_tran_reset_dport port 3 reset port
NOTICE: ahci0: ahci_tran_reset_dport port 0 reset port
NOTICE: ahci0: ahci_tran_reset_dport port 1 reset port
NOTICE: ahci0: ahci_tran_reset_dport port 2 reset port

I found this ancient, unsolved bug report with a similar stacktrace. Similar to that bug report, we ran memtest86+ on this system prior to putting it into production several months ago and it didn't indicate any problems.

I can't provide the full coredump since it's a production system that contains customer data, but I'd be happy to run additional mdb commands to provide more info if that would help. (I'm not quite sure where to look next.)

Sep 23 '24 03:09 smokris

Is that system using ECC RAM? Memory problems (if it is memory problem) can appear at any time, so several months old memtest run is only telling that memory was ok during the time of the test.

Sep 23 '24 08:09 tsoome

Hi, Toomas; thanks for your reply.

Is that system using ECC RAM?

Yes, the CPU (Intel Xeon E-2378) and logic board (Asus P12R-M) are ECC-capable, and its RAM sticks (32GB PC4-25600 3200MHz DDR4 ECC UDIMM) are ECC.

In SmartOS, is there a way to verify that ECC is actually enabled? (I checked sysinfo and prtdiag -v but didn't see any obvious indication.)

Memory problems (if it is memory problem) can appear at any time, so several months old memtest run is only telling that memory was ok during the time of the test.

Ah, good point. I'll try to schedule some downtime to run memtest again.

Sep 23 '24 16:09 smokris

Also check fmadm faulty since it's ECC, you might see something there.

Sep 23 '24 17:09 danmcd

Also check fmadm faulty since it's ECC, you might see something there.

Thanks; I checked that (and fmdump -v and /var/adm/messages) and didn't find any messages about ECC errors.

Sep 23 '24 17:09 smokris

ok, no ECC errors and two similar stack traces... tbh, I find it hard to believe it is memory corruption. Exactly the same function pointer getting zeroed in those two cases? Anyhow, it means we need to do some digging.

Sep 23 '24 18:09 tsoome

I checked fmadm faulty (and fmdump -v and /var/adm/messages) and didn't find any messages about ECC errors.

Oh! I found a series of Uncorrectable ECC errors in the system's IPMI Event Log, starting about 7 minutes prior to the panic:

# ipmitool sel list
…
10d1 | 09/22/2024 | 11:40:16 | Memory #0xd1 | Uncorrectable ECC | Asserted

So it turns out there is (at least) a hardware problem.

Considering the timing correlation, this panic now looks likely to have been caused by hardware failure. And since it's only the second known occurrence of this panic in 13 years (the prior being Illumos bug #1034 mentioned above), I'll tentatively close this issue — but of course feel free to reopen it if you'd still like to investigate further to see whether there's also a software problem.

Sep 25 '24 18:09 smokris