vendor-reset icon indicating copy to clipboard operation
vendor-reset copied to clipboard

Radeon PRO WX7100 (Polaris 10) reset bug not fixed

Open collector-ynh opened this issue 2 years ago • 2 comments

Hi, I'm using a Radeon Pro WX7100 (Polaris 10) under Debian12/Proxmox with kernel 6.2.16-5-pve, I'm also encountering the reset bug, after closing a VM I can't integrate the graphics card back into the VM. Proxmox returns the following error:

Sélection_226

I tried to compile your tool but it didn't work, then I used the "dkms" tool but no change in the GPU reset after the Reboot.

Here are the dmesg logs from the beginning until the reset bug: https://termbin.com/q2f3 Here are the extracts indicating the reset bug:

[ 5634.188698] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[ 5634.188708] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[ 5634.188714] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x1e@0x370
[ 5649.026420] kvm [52392]: ignored rdmsr: 0xc0011029 data 0x0
[ 5682.626559] pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0
[ 5682.626567] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[ 5682.626574] pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
[ 5682.626579] pcieport 0000:00:01.0:    [14] CmpltTO                (First)
[ 5684.895216] vfio-pci 0000:01:00.0: not ready 1023ms after bus reset; waiting
[ 5685.954898] vfio-pci 0000:01:00.0: not ready 2047ms after bus reset; waiting
[ 5688.223176] vfio-pci 0000:01:00.0: not ready 4095ms after bus reset; waiting
[ 5692.574871] vfio-pci 0000:01:00.0: not ready 8191ms after bus reset; waiting
[ 5692.908805]  zd48: p1 p2
[ 5693.073525] fwbr151i0: port 2(tap151i0) entered disabled state
[ 5693.102037] fwbr151i0: port 1(fwln151i0) entered disabled state
[ 5693.102073] vmbr0: port 2(fwpr151p0) entered disabled state
[ 5693.102287] device fwln151i0 left promiscuous mode
[ 5693.102290] fwbr151i0: port 1(fwln151i0) entered disabled state
[ 5693.135330] device fwpr151p0 left promiscuous mode
[ 5693.135333] vmbr0: port 2(fwpr151p0) entered disabled state
[ 5693.426664] vfio-pci 0000:01:00.1: Unable to change power state from D0 to D3hot, device inaccessible
[ 5693.426723] vfio-pci 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible
[ 5693.427088] vfio-pci 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible
[ 5693.427379] vfio-pci 0000:01:00.0: Unable to change power state from D0 to D3hot, device inaccessible
[ 5701.023015] vfio-pci 0000:01:00.0: not ready 16383ms after bus reset; waiting
[ 5702.518971] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 5719.198775] vfio-pci 0000:01:00.0: not ready 32767ms after bus reset; waiting
[ 5754.014338] vfio-pci 0000:01:00.0: not ready 65535ms after bus reset; giving up
[ 5754.014347] pcieport 0000:00:01.0: AER: Root Port link has been reset (-25)
[ 5754.014349] pcieport 0000:00:01.0: AER: subordinate device reset failed
[ 5754.014368] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 5754.014392] pcieport 0000:00:01.0: AER: device recovery failed
[ 5754.014910] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 5754.591663] device tap151i0 entered promiscuous mode
[ 5754.632069] vmbr0: port 2(fwpr151p0) entered blocking state
[ 5754.632073] vmbr0: port 2(fwpr151p0) entered disabled state
[ 5754.632115] device fwpr151p0 entered promiscuous mode
[ 5754.632139] vmbr0: port 2(fwpr151p0) entered blocking state
[ 5754.632141] vmbr0: port 2(fwpr151p0) entered forwarding state
[ 5754.639897] fwbr151i0: port 1(fwln151i0) entered blocking state
[ 5754.639900] fwbr151i0: port 1(fwln151i0) entered disabled state
[ 5754.639939] device fwln151i0 entered promiscuous mode
[ 5754.639965] fwbr151i0: port 1(fwln151i0) entered blocking state
[ 5754.639966] fwbr151i0: port 1(fwln151i0) entered forwarding state
[ 5754.647778] fwbr151i0: port 2(tap151i0) entered blocking state
[ 5754.647781] fwbr151i0: port 2(tap151i0) entered disabled state
[ 5754.647846] fwbr151i0: port 2(tap151i0) entered blocking state
[ 5754.647847] fwbr151i0: port 2(tap151i0) entered forwarding state
[ 5755.719980] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 5755.719994] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 5755.720075] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 5755.720739] vfio-pci 0000:01:00.0: vfio_cap_init: hiding cap 0xff@0xff
[ 5755.721892] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
[ 5755.721893] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
[ 5755.762130] fwbr151i0: port 2(tap151i0) entered disabled state
[ 5755.762304] fwbr151i0: port 2(tap151i0) entered disabled state
[ 5755.942689] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 5755.943089] vfio-pci 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible
[ 5755.943095] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 5755.943299] vfio-pci 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible
[ 5758.238273] vfio-pci 0000:01:00.0: not ready 1023ms after bus reset; waiting
[ 5759.294260] vfio-pci 0000:01:00.0: not ready 2047ms after bus reset; waiting
[ 5761.438157] vfio-pci 0000:01:00.0: not ready 4095ms after bus reset; waiting
[ 5765.790192] vfio-pci 0000:01:00.0: not ready 8191ms after bus reset; waiting
[ 5774.238071] vfio-pci 0000:01:00.0: not ready 16383ms after bus reset; waiting
[ 5790.877860] vfio-pci 0000:01:00.0: not ready 32767ms after bus reset; waiting
[ 5825.693414] vfio-pci 0000:01:00.0: not ready 65535ms after bus reset; giving up
[ 5825.715750] vfio-pci 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible
[ 5825.715760] vfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible

I also found this patch https://gist.github.com/shatsky/2c8959eb3b9d2528ee8a7b9f58467aa0 Which was mentioned in this reddit: https://www.reddit.com/r/VFIO/comments/enmnnj/trying_to_understand_amd_polaris_reset_bug/

But I don't know how to integrate it with kernel 6.2.16-5-pve, or if it will work for me!

Thank you in advance for helping me find a solution, as it's very important for me to get this graphics card working in my server VMs 🙏

collector-ynh avatar Jul 30 '23 03:07 collector-ynh

Simply doing dkms install will not setup the module to load automatically during boot.

There's a udev rules file(udev/99-vendor-reset.rules) that you need to copy to where udev rules live on your distro(usually /etc/udev/rules.d) and then reboot.

linnaea avatar Aug 11 '23 09:08 linnaea

WX9100 works

labor4 avatar Sep 27 '23 19:09 labor4