hostboot
hostboot copied to clipboard
Sudden machine hang / GUARD event when using amdgpu under Linux 5.1.16
When using QEMU vfio (Kernel 5.1, QEMU 4 GIT master) with VFIO PCIe passthrough and an AMD GPU (amdgpu driver), the test machine hard locks and GUARD entries are created for various slices (e.g. EQ0, EQ1, etc.) on subsequent reboot, gradually removing all functional slices from the system:
10.82404|ISTEP 6. 9 - host_gard
12.42766|================================================
12.44729|Error reported by prdf (0xE500) PLID 0x9000009E
12.44730| PRD Signature : 0x60000 0xC6D10010
12.47015| Signature Description : pu.ex:k0:n0:s0:p00:c0 (L2FIR[16]) Cache line inhibited hit cacheable space
12.49825| UserData1 : 0x0006000000000101
12.49826| UserData2 : 0xc6d1001000000000
12.49827|------------------------------------------------
12.51947| Callout type : Hardware Callout
12.51948| CPU id : 4
12.51950| Target : Physical:/Sys0/Node0/Proc0/EQ0/EX0
12.51951| Deconfig State : NO_DECONFIG
12.51952| GARD Error Type : GARD_Fatal
12.51952| Priority : SRCI_PRIORITY_MED
12.51953|------------------------------------------------
12.51954|
12.51954|------------------------------------------------
12.51955| System checkstop occurred during runtime on previous boot
12.51956|------------------------------------------------
12.51956|
12.51957|------------------------------------------------
12.51958| Hostboot Build ID: hostboot-3beba24/hbicore.bin
12.51959|================================================
The PCIe device being passed through is on Proc1, not Proc0, and the QEMU guest kernel panics if QEMU is not pinned to Proc1.
Is this likely to indicate a defective CPU or a hostboot / kernel bug?
EDIT: Apparently it's related to just the amdgpu driver on 5.1.16 -- still, hostboot shouldn't GUARD out slices based on a kernel bug like this. It makes recovery a lot harder and gives a false impression of a failing CPU.
I've encountered this issue using an amd gpu connected to a P9 host (no QEMU/VFIO) running kernel 5.1.15. Occasionally playing videos will hard reset the system and GUARD out cores.
The issue does not seem present with kernels 4.19.52 and 5.0.9 (the two other versions I happen to have on my system).
FYI - This is being discussed by our RAS team this week. Basically the reason for the guard action is because many decisions were made assuming a solid hypervisor rather than a development kernel. Not all of the error paths have been scrubbed to assume a buggy kernel vs hardware failures.
I was able to get the machine to hang (randomly after about 2 days of uptime, didn't need to do anything like play video or use qemu) because of apparently the same amdgpu bug on 5.1.17 (with Radeon Pro WX 5100), but it didn't seem to GUARD out anything, I have a Talos 2 Lite with a single 18-core CPU and the latest stable firmware package 1.06. This is on serial console after I attempt a restart https://gist.github.com/q66/e26e10e3cd12ffc991d627351ce9d6a1
@q66 @shawnanastasio is there an upstream kernel bug report tracking this issue yet?
yes: https://bugzilla.kernel.org/show_bug.cgi?id=204145