vendor-reset icon indicating copy to clipboard operation
vendor-reset copied to clipboard

How to add new GPUs/devices?

Open ghost opened this issue 2 years ago • 7 comments

Does any of you guys understand the code? For me it's a little to close to hardware, so I don't get the implementation.

  1. Is it enough to add my IDs to "device-db.h"? or..
  2. Do I need to create a new implementation for a new device?

In my case I'd like to reset a Cezanne iGPU.

ghost avatar Aug 19 '23 09:08 ghost

In the file amdgpu.h from Linux kernel source code is a useful explanation about the different reset strategies. You can find out which vendor ID belongs to which chipset naming in the driver in amdgpu_drv.c.

 537 /**
 538  * enum amd_reset_method - Methods for resetting AMD GPU devices
 539  *
 540  * @AMD_RESET_METHOD_NONE: The device will not be reset.
 541  * @AMD_RESET_LEGACY: Method reserved for SI, CIK and VI ASICs.
 542  * @AMD_RESET_MODE0: Reset the entire ASIC. Not currently available for the
 543  *                   any device.
 544  * @AMD_RESET_MODE1: Resets all IP blocks on the ASIC (SDMA, GFX, VCN, etc.)
 545  *                   individually. Suitable only for some discrete GPU, not
 546  *                   available for all ASICs.
 547  * @AMD_RESET_MODE2: Resets a lesser level of IPs compared to MODE1. Which IPs
 548  *                   are reset depends on the ASIC. Notably doesn't reset IPs
 549  *                   shared with the CPU on APUs or the memory controllers (so
 550  *                   VRAM is not lost). Not available on all ASICs.
 551  * @AMD_RESET_BACO: BACO (Bus Alive, Chip Off) method powers off and on the card
 552  *                  but without powering off the PCI bus. Suitable only for
 553  *                  discrete GPUs.
 554  * @AMD_RESET_PCI: Does a full bus reset using core Linux subsystem PCI reset
 555  *                 and does a secondary bus reset or FLR, depending on what the
 556  *                 underlying hardware supports.
 557  *
 558  * Methods available for AMD GPU driver for resetting the device. Not all
 559  * methods are suitable for every device. User can override the method using
 560  * module parameter `reset_method`.
 561  */ 

This then lead me to believe, that for most GPUs chances wouldn't be so low that you could simply try to add your vendor code from lspci -nnk | grep VGA to device-db.h. In my case I added 1638 for Ryzen 5 5600G under _AMD_NAVI14(op) and it actually worked. The rationale for picking Navi10 was based on how the source code of navi10.c seemed to fit best to the explanation above. Though there are basically only vega10, vega20, navi10 and polaris10 that you could try at random.

Don't forget that you need to modprobe vendor_reset and change the reset_method of your PCIe device to "device_specific", like it is done in udev/99-vendor-reset.rules.

I can power the VM on and off now thanks to vendor-reset, without being required to restart my entire PC because of this error:

qemu-system-x86_64: ../qemu-9.1.2/hw/pci/pci.c:1637: pci_irq_handler: Assertion 0 <= irq_num && irq_num < PCI_NUM_PINS' failed.

I also use Virtual-Display-Driver so I don't have to connect a monitor for Looking Glass to produce output, and this also simultaneously gets rid of another bug.

ballerburg9005 avatar Dec 04 '24 19:12 ballerburg9005

https://github.com/gnif/vendor-reset/pull/89

ballerburg9005 avatar Dec 06 '24 21:12 ballerburg9005

@ballerburg9005 Thanks a ton for this information!!!

I have some improvements as it was not completely clear to me what I needed to do to make that hook running:

everything should be done as root

  1. add vendor id in my case 1636 for Vega6/Ryzen 3 4350G to vendor-reset/src/device-db.h; That ID can be found by lspci -nnk | grep VGA should look like that: {PCI_VENDOR_ID_ATI, 0x1636, op, DEVICE_INFO(AMD_NAVI14)}, \
  2. go to vendor-reset and run make I think this error Skipping BTF generation for /home/nasadmin/vendor-reset/vendor-reset.ko due to unavailability of vmlinux can be ignored,. Tried to fix it which made it worse, in the end had more errors, which I didnt know how to fix. If more errors occur, most likely not root, missing headers! or some weird kernel config. Actually updated my kernel in the process. which was not a good idea. After a restart everything was good again.
  3. install vendor_reset module with dkms install .
  4. echo "vendor-reset" >> /etc/modules
  5. to test if the vendor_reset is there you can also do rmmod vendor_reset and lsmod vendor_reset
  6. reboot
  7. copy vendor-reset/udev/99-vendor-reset.rules to /etc/udev/rules.d/99-vendor-reset.rules and add vendor id here as well: should look like that:
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x1002", ATTR{device}=="0x1636", RUN+="/bin/sh -c '/sbin/modprobe vendor-reset; echo device_specific > /sys$env{DEVPATH}/reset_method'"
  1. run udevadm control --reload-rules && udevadm trigger

now journalctl -xe should show sth like:

Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: version 1.1
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: performing pre-reset
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: performing reset
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 01:26:36 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 01:26:36 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: bus reset disabled? yes
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: SMU response reg: 1, sol reg: 2ba060, mp1 intr enabled? yes, bl ready? yes
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: Clearing scratch regs 6 and 7
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: begin psp mode 1 reset
Feb 10 01:26:37 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: mode1 reset succeeded
Feb 10 01:26:37 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: PSP mode1 reset successful
Feb 10 01:26:37 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: performing post-reset
Feb 10 01:26:37 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: reset result = 0

which means the hook is working

if not then you most likely have a typo in the vendor id :D (was the case for me)

BUT I had no luck with NAVI so far. Still Code 43 in Windows and no output in Fedora. Will test the other methodes as well. Lets see if I will get some output. Hopefully.

Funny enough when I use Windows alone reboots and shutdowns work fine when I use a enable and disable script.

But I want Mac and Linux VMs to use my iGPU as well :) So hopefully I get a hook like that running.

Dont like the other hooks with bringing the host in sleep mode...

No Luck with Vega20

Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: version 1.0
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: performing pre-reset
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: performing reset
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 02:11:58 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 02:11:58 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: bus reset disabled? yes
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: SMU response reg: fe, sol reg: 342b49, mp1 intr enabled? no, bl ready? no, baco? off
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: could not get enabled SMU features, trying BACO reset anyway [ret -110]
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: entering BACO
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: failed to reset device
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: performing post-reset
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: reset result = 0
Feb 10 02:11:59 omv kernel: kauditd_printk_skb: 3 callbacks suppressed

and Vega10

Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: version 1.0
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: performing pre-reset
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: performing reset
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 02:10:49 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 02:10:49 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: bus reset disabled? yes
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: SMU response reg: 1, sol reg: 33f213, mp1 intr enabled? no, bl ready? no, baco? off
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: enabled features: 0
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: disabling features
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: Driver reset
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: Failed to send message 0x45: return 0xfe
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: Could not reset w/ PPSMC_MSG_GfxDeviceDriverReset: 254
Feb 10 02:10:50 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: entering BACO
Feb 10 02:10:50 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:10:50 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: failed to reset device
Feb 10 02:10:50 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: performing post-reset
Feb 10 02:10:50 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: reset result = 0

although failed doesnt sound good.

POLARIS10 same

Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: version 1.0
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: performing pre-reset
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: performing reset
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 02:19:13 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 02:19:13 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: bus reset disabled? yes
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: SMU response reg: fe, sol reg: 3585c0, mp1 intr enabled? no, bl ready? no, baco? off
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: could not get enabled SMU features, trying BACO reset anyway [ret -110]
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: entering BACO
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: failed to reset device
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: performing post-reset
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: reset result = 0
Feb 10 02:19:13 omv kernel: kauditd_printk_skb: 3 callbacks suppressed

Polaris12

Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: version 1.0
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: performing pre-reset
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: performing reset
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 02:23:26 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 02:23:26 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: bus reset disabled? yes
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: SMU response reg: fe, sol reg: 3650d2, mp1 intr enabled? no, bl ready? no, baco? off
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: could not get enabled SMU features, trying BACO reset anyway [ret -110]
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: entering BACO
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: failed to reset device
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: performing post-reset
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: reset result = 0
Feb 10 02:23:26 omv kernel: kauditd_printk_skb: 3 callbacks suppressed

shutdown brought another weird error

Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: version 1.0
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: performing pre-reset
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: performing reset
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 02:25:37 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 02:25:37 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: bus reset disabled? yes
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: SMU response reg: 1, sol reg: 36ba48, mp1 intr enabled? no, bl ready? no, baco? off
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: enabled features: 0
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: disabling features
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: Driver reset
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: Failed to send message 0x45: return 0xfe
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: Could not reset w/ PPSMC_MSG_GfxDeviceDriverReset: 254
Feb 10 02:25:38 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: entering BACO
Feb 10 02:25:38 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:25:38 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: failed to reset device
Feb 10 02:25:38 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: performing post-reset
Feb 10 02:25:38 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: reset result = 0

No luck...too bad so for my card its not working.

Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: version 1.1
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: performing pre-reset
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: performing reset
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 02:42:22 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 02:42:22 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: bus reset disabled? yes
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: SMU response reg: 1, sol reg: 1b3e78, mp1 intr enabled? yes, bl ready? yes
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: Clearing scratch regs 6 and 7
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: begin psp mode 1 reset
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: mode1 reset succeeded
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: PSP mode1 reset successful
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: performing post-reset
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: reset result = 0
Feb 10 02:42:23 omv sudo[9137]: pam_unix(sudo:session): session closed for user root
Feb 10 02:42:23 omv kernel: kauditd_printk_skb: 3 callbacks suppressed

Same Output. But my Card has error 43 in Windows and no output...

dmuiX avatar Feb 10 '25 00:02 dmuiX

@ballerburg9005 Though there are basically only vega10, vega20, navi10 and polaris10

Why not Navi14? That exists as well. If its the same as navi10 how do you know that?

Thanks

dmuiX avatar Feb 10 '25 00:02 dmuiX

@dmuiX where have you stored your vbios, did you extract it as mentioned in isc30's repo? And how does vendor-reset know the location for the vbios, only in the vm configuration file,

hostpci0: 0000:75:00.0,pcie=1,romfile=vbios_6600H.bin

like so?

Thanks

kichappa avatar May 13 '25 20:05 kichappa

actually the rom of the hdmi audio device was the problem when I imported the correct one the error 43 was gone and I had a working iGPU Passthrough with reset, which is amazing!! Shutdown works just great. If I need to reset the vm its breaking. But this is expected. Havent tested putting the gpu on another vm.

dmuiX avatar Sep 15 '25 17:09 dmuiX

@dmuiX where have you stored your vbios, did you extract it as mentioned in isc30's repo? And how does vendor-reset know the location for the vbios, only in the vm configuration file,

hostpci0: 0000:75:00.0,pcie=1,romfile=vbios_6600H.bin

like so?

Thanks

I extracted the vbios and die audio device rom from a bios update of my mainboard using ubu: https://winraid.level1techs.com/t/tool-guide-news-uefi-bios-updater-ubu/3035

after extraction it seems i have converted it as well...so just extraction is not enough

follow here for more infos: https://gist.github.com/matt22207/bb1ba1811a08a715e32f106450b0418a?permalink_comment_id=4955044#gistcomment-4955044

I suppose the vendor-reset doesnt need to know the vbios. Just kvm need to know where to look for and its automatically applied by the vendor-reset when you boot the vm.

the location of the vbios in debian is /usr/share/vgabios. as long as I remember kvm doesnt accept any other location.

dmuiX avatar Sep 15 '25 17:09 dmuiX