hostboot Sudden machine hang / GUARD event when using amdgpu under Linux 5.1.16

When using QEMU vfio (Kernel 5.1, QEMU 4 GIT master) with VFIO PCIe passthrough and an AMD GPU (amdgpu driver), the test machine hard locks and GUARD entries are created for various slices (e.g. EQ0, EQ1, etc.) on subsequent reboot, gradually removing all functional slices from the system:

 10.82404|ISTEP  6. 9 - host_gard
 12.42766|================================================
 12.44729|Error reported by prdf (0xE500) PLID 0x9000009E
 12.44730|  PRD Signature            : 0x60000 0xC6D10010
 12.47015|  Signature Description    : pu.ex:k0:n0:s0:p00:c0 (L2FIR[16]) Cache line inhibited hit cacheable space
 12.49825|  UserData1   : 0x0006000000000101
 12.49826|  UserData2   : 0xc6d1001000000000
 12.49827|------------------------------------------------
 12.51947|  Callout type             : Hardware Callout
 12.51948|  CPU id                   : 4
 12.51950|  Target                   : Physical:/Sys0/Node0/Proc0/EQ0/EX0
 12.51951|  Deconfig State           : NO_DECONFIG
 12.51952|  GARD Error Type          : GARD_Fatal
 12.51952|  Priority                 : SRCI_PRIORITY_MED
 12.51953|------------------------------------------------
 12.51954|
 12.51954|------------------------------------------------
 12.51955|  System checkstop occurred during runtime on previous boot
 12.51956|------------------------------------------------
 12.51956|
 12.51957|------------------------------------------------
 12.51958|  Hostboot Build ID: hostboot-3beba24/hbicore.bin
 12.51959|================================================

The PCIe device being passed through is on Proc1, not Proc0, and the QEMU guest kernel panics if QEMU is not pinned to Proc1.

Is this likely to indicate a defective CPU or a hostboot / kernel bug?

EDIT: Apparently it's related to just the amdgpu driver on 5.1.16 -- still, hostboot shouldn't GUARD out slices based on a kernel bug like this. It makes recovery a lot harder and gives a false impression of a failing CPU.

Jul 07 '19 00:07 madscientist159

I've encountered this issue using an amd gpu connected to a P9 host (no QEMU/VFIO) running kernel 5.1.15. Occasionally playing videos will hard reset the system and GUARD out cores.

The issue does not seem present with kernels 4.19.52 and 5.0.9 (the two other versions I happen to have on my system).

Jul 07 '19 01:07 shawnanastasio

FYI - This is being discussed by our RAS team this week. Basically the reason for the guard action is because many decisions were made assuming a solid hypervisor rather than a development kernel. Not all of the error paths have been scrubbed to assume a buggy kernel vs hardware failures.

Jul 09 '19 17:07 dcrowell77

I was able to get the machine to hang (randomly after about 2 days of uptime, didn't need to do anything like play video or use qemu) because of apparently the same amdgpu bug on 5.1.17 (with Radeon Pro WX 5100), but it didn't seem to GUARD out anything, I have a Talos 2 Lite with a single 18-core CPU and the latest stable firmware package 1.06. This is on serial console after I attempt a restart https://gist.github.com/q66/e26e10e3cd12ffc991d627351ce9d6a1

Jul 14 '19 16:07 q66

@q66 @shawnanastasio is there an upstream kernel bug report tracking this issue yet?

Jul 15 '19 21:07 madscientist159

yes: https://bugzilla.kernel.org/show_bug.cgi?id=204145

Jul 15 '19 21:07 q66

hostboot hostboot copied to clipboard

Sudden machine hang / GUARD event when using amdgpu under Linux 5.1.16

hostboot
hostboot copied to clipboard