Dual-Edge-TPU-Adapter icon indicating copy to clipboard operation
Dual-Edge-TPU-Adapter copied to clipboard

Host crashing after passthrough in proxmox

Open joshtwc opened this issue 1 month ago • 4 comments

So long story short, I am setting up a home assistant/frigate vm and I need to pass through the dual edge tpu to frigate. I have come very close and it appears in home assistant and in frigate, but after some time it will crash the host (which is an HP ProLiant DL380 G10) running Proxmox 8.2 with the following error messages (in iLO):

Uncorrectable Machine Check Exception (Processor 1, APIC ID 0x00000000, Bank 0x00000006, Status 0xBB800000'00000E0B, Address 0x00000000'00000000, Misc 0x00000000'36000000).
Uncorrectable PCI Express Error Detected. Slot 2 (Segment 0x0, Bus 0x36, Device 0x0, Function 0x0). Uncorrectable Error Status: 0x4000```

Here is the lspci information:

37:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch [1b21:1182]
        Subsystem: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch [1b21:118f]
        Kernel driver in use: pcieport
38:03.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch [1b21:1182]
        Subsystem: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch [1b21:118f]
        Kernel driver in use: pcieport
38:07.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch [1b21:1182]
        Subsystem: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch [1b21:118f]
        Kernel driver in use: pcieport
39:00.0 System peripheral [0880]: Global Unichip Corp. Coral Edge TPU [1ac1:089a]
        Subsystem: Global Unichip Corp. Coral Edge TPU [1ac1:089a]
        Kernel driver in use: vfio-pci
3a:00.0 System peripheral [0880]: Global Unichip Corp. Coral Edge TPU [1ac1:089a]
        Subsystem: Global Unichip Corp. Coral Edge TPU [1ac1:089a]
        Kernel driver in use: vfio-pci

My VM config:

agent: 1
bios: ovmf
boot: order=scsi0
cores: 12
cpu: host
efidisk0: local-lvm:vm-100-disk-0,efitype=4m,size=4M
hostpci0: 0000:3a:00
hostpci1: 0000:39:00
localtime: 1
memory: 65536
meta: creation-qemu=8.1.5,ctime=1712586677
name: #########
numa: 0
ostype: l26
protection: 1
scsi0: local-lvm:vm-100-disk-1,cache=writethrough,discard=on,size=32G,ssd=1
scsihw: virtio-scsi-pci
sockets: 2
tablet: 0
tags:  

It is an dual intel xeon motherboard, the adapter is plugged into a riser card at the back of the unit. I have tried the following:

  • Disabling SR-IOV in bios
  • Changing pcie configuration to gen 1 (bios)
  • Updating the grub cmdline for iommu (intel_iommu=on, iommu=pt, etc)
  • Changing which pcie port it is plugged into

Its strange that it only crashes upon starting frigate, and it runs for a bit (stable) until it crashes suddenly with no useful logs other than those from HP Integrated Lights Out (iLO)

joshtwc avatar May 22 '24 14:05 joshtwc