How to add new GPUs/devices?
Does any of you guys understand the code? For me it's a little to close to hardware, so I don't get the implementation.
- Is it enough to add my IDs to "device-db.h"? or..
- Do I need to create a new implementation for a new device?
In my case I'd like to reset a Cezanne iGPU.
In the file amdgpu.h from Linux kernel source code is a useful explanation about the different reset strategies. You can find out which vendor ID belongs to which chipset naming in the driver in amdgpu_drv.c.
537 /**
538 * enum amd_reset_method - Methods for resetting AMD GPU devices
539 *
540 * @AMD_RESET_METHOD_NONE: The device will not be reset.
541 * @AMD_RESET_LEGACY: Method reserved for SI, CIK and VI ASICs.
542 * @AMD_RESET_MODE0: Reset the entire ASIC. Not currently available for the
543 * any device.
544 * @AMD_RESET_MODE1: Resets all IP blocks on the ASIC (SDMA, GFX, VCN, etc.)
545 * individually. Suitable only for some discrete GPU, not
546 * available for all ASICs.
547 * @AMD_RESET_MODE2: Resets a lesser level of IPs compared to MODE1. Which IPs
548 * are reset depends on the ASIC. Notably doesn't reset IPs
549 * shared with the CPU on APUs or the memory controllers (so
550 * VRAM is not lost). Not available on all ASICs.
551 * @AMD_RESET_BACO: BACO (Bus Alive, Chip Off) method powers off and on the card
552 * but without powering off the PCI bus. Suitable only for
553 * discrete GPUs.
554 * @AMD_RESET_PCI: Does a full bus reset using core Linux subsystem PCI reset
555 * and does a secondary bus reset or FLR, depending on what the
556 * underlying hardware supports.
557 *
558 * Methods available for AMD GPU driver for resetting the device. Not all
559 * methods are suitable for every device. User can override the method using
560 * module parameter `reset_method`.
561 */
This then lead me to believe, that for most GPUs chances wouldn't be so low that you could simply try to add your vendor code from lspci -nnk | grep VGA to device-db.h. In my case I added 1638 for Ryzen 5 5600G under _AMD_NAVI14(op) and it actually worked. The rationale for picking Navi10 was based on how the source code of navi10.c seemed to fit best to the explanation above. Though there are basically only vega10, vega20, navi10 and polaris10 that you could try at random.
Don't forget that you need to modprobe vendor_reset and change the reset_method of your PCIe device to "device_specific", like it is done in udev/99-vendor-reset.rules.
I can power the VM on and off now thanks to vendor-reset, without being required to restart my entire PC because of this error:
qemu-system-x86_64: ../qemu-9.1.2/hw/pci/pci.c:1637: pci_irq_handler: Assertion 0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
I also use Virtual-Display-Driver so I don't have to connect a monitor for Looking Glass to produce output, and this also simultaneously gets rid of another bug.
https://github.com/gnif/vendor-reset/pull/89
@ballerburg9005 Thanks a ton for this information!!!
I have some improvements as it was not completely clear to me what I needed to do to make that hook running:
everything should be done as root
- add vendor id in my case 1636 for Vega6/Ryzen 3 4350G to
vendor-reset/src/device-db.h; That ID can be found bylspci -nnk | grep VGAshould look like that:{PCI_VENDOR_ID_ATI, 0x1636, op, DEVICE_INFO(AMD_NAVI14)}, \ - go to vendor-reset and run
makeI think this errorSkipping BTF generation for /home/nasadmin/vendor-reset/vendor-reset.ko due to unavailability of vmlinuxcan be ignored,. Tried to fix it which made it worse, in the end had more errors, which I didnt know how to fix. If more errors occur, most likely not root, missing headers! or some weird kernel config. Actually updated my kernel in the process. which was not a good idea. After a restart everything was good again. - install vendor_reset module with
dkms install . -
echo "vendor-reset" >> /etc/modules - to test if the vendor_reset is there you can also do rmmod vendor_reset and lsmod vendor_reset
- reboot
- copy vendor-reset/udev/99-vendor-reset.rules to /etc/udev/rules.d/99-vendor-reset.rules and add vendor id here as well: should look like that:
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x1002", ATTR{device}=="0x1636", RUN+="/bin/sh -c '/sbin/modprobe vendor-reset; echo device_specific > /sys$env{DEVPATH}/reset_method'"
- run
udevadm control --reload-rules && udevadm trigger
now journalctl -xe should show sth like:
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: version 1.1
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: performing pre-reset
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: performing reset
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 01:26:36 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 01:26:36 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: bus reset disabled? yes
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: SMU response reg: 1, sol reg: 2ba060, mp1 intr enabled? yes, bl ready? yes
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: Clearing scratch regs 6 and 7
Feb 10 01:26:36 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: begin psp mode 1 reset
Feb 10 01:26:37 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: mode1 reset succeeded
Feb 10 01:26:37 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: PSP mode1 reset successful
Feb 10 01:26:37 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: performing post-reset
Feb 10 01:26:37 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI14: reset result = 0
which means the hook is working
if not then you most likely have a typo in the vendor id :D (was the case for me)
BUT I had no luck with NAVI so far. Still Code 43 in Windows and no output in Fedora. Will test the other methodes as well. Lets see if I will get some output. Hopefully.
Funny enough when I use Windows alone reboots and shutdowns work fine when I use a enable and disable script.
But I want Mac and Linux VMs to use my iGPU as well :) So hopefully I get a hook like that running.
Dont like the other hooks with bringing the host in sleep mode...
No Luck with Vega20
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: version 1.0
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: performing pre-reset
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: performing reset
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 02:11:58 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 02:11:58 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: bus reset disabled? yes
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: SMU response reg: fe, sol reg: 342b49, mp1 intr enabled? no, bl ready? no, baco? off
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: could not get enabled SMU features, trying BACO reset anyway [ret -110]
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: entering BACO
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: failed to reset device
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: performing post-reset
Feb 10 02:11:58 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA20: reset result = 0
Feb 10 02:11:59 omv kernel: kauditd_printk_skb: 3 callbacks suppressed
and Vega10
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: version 1.0
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: performing pre-reset
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: performing reset
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 02:10:49 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 02:10:49 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: bus reset disabled? yes
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: SMU response reg: 1, sol reg: 33f213, mp1 intr enabled? no, bl ready? no, baco? off
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: enabled features: 0
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: disabling features
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: Driver reset
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: Failed to send message 0x45: return 0xfe
Feb 10 02:10:49 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: Could not reset w/ PPSMC_MSG_GfxDeviceDriverReset: 254
Feb 10 02:10:50 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: entering BACO
Feb 10 02:10:50 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:10:50 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: failed to reset device
Feb 10 02:10:50 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: performing post-reset
Feb 10 02:10:50 omv kernel: vfio-pci 0000:06:00.0: AMD_VEGA10: reset result = 0
although failed doesnt sound good.
POLARIS10 same
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: version 1.0
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: performing pre-reset
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: performing reset
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 02:19:13 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 02:19:13 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: bus reset disabled? yes
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: SMU response reg: fe, sol reg: 3585c0, mp1 intr enabled? no, bl ready? no, baco? off
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: could not get enabled SMU features, trying BACO reset anyway [ret -110]
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: entering BACO
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: failed to reset device
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: performing post-reset
Feb 10 02:19:13 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS10: reset result = 0
Feb 10 02:19:13 omv kernel: kauditd_printk_skb: 3 callbacks suppressed
Polaris12
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: version 1.0
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: performing pre-reset
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: performing reset
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 02:23:26 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 02:23:26 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: bus reset disabled? yes
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: SMU response reg: fe, sol reg: 3650d2, mp1 intr enabled? no, bl ready? no, baco? off
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: could not get enabled SMU features, trying BACO reset anyway [ret -110]
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: entering BACO
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: failed to reset device
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: performing post-reset
Feb 10 02:23:26 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: reset result = 0
Feb 10 02:23:26 omv kernel: kauditd_printk_skb: 3 callbacks suppressed
shutdown brought another weird error
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: version 1.0
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: performing pre-reset
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: performing reset
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 02:25:37 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 02:25:37 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: bus reset disabled? yes
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: SMU response reg: 1, sol reg: 36ba48, mp1 intr enabled? no, bl ready? no, baco? off
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: enabled features: 0
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: disabling features
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: Driver reset
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: Failed to send message 0x45: return 0xfe
Feb 10 02:25:37 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: Could not reset w/ PPSMC_MSG_GfxDeviceDriverReset: 254
Feb 10 02:25:38 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: entering BACO
Feb 10 02:25:38 omv kernel: vfio-pci 0000:06:00.0: SMU error 0xfe
Feb 10 02:25:38 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: failed to reset device
Feb 10 02:25:38 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: performing post-reset
Feb 10 02:25:38 omv kernel: vfio-pci 0000:06:00.0: AMD_POLARIS12: reset result = 0
No luck...too bad so for my card its not working.
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: version 1.1
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: performing pre-reset
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: performing reset
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
Feb 10 02:42:22 omv kernel: ATOM BIOS: 113-RENOIR-035
Feb 10 02:42:22 omv kernel: vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: bus reset disabled? yes
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: SMU response reg: 1, sol reg: 1b3e78, mp1 intr enabled? yes, bl ready? yes
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: Clearing scratch regs 6 and 7
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: begin psp mode 1 reset
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: mode1 reset succeeded
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: PSP mode1 reset successful
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: performing post-reset
Feb 10 02:42:22 omv kernel: vfio-pci 0000:06:00.0: AMD_NAVI12: reset result = 0
Feb 10 02:42:23 omv sudo[9137]: pam_unix(sudo:session): session closed for user root
Feb 10 02:42:23 omv kernel: kauditd_printk_skb: 3 callbacks suppressed
Same Output. But my Card has error 43 in Windows and no output...
@ballerburg9005
Though there are basically only vega10, vega20, navi10 and polaris10
Why not Navi14? That exists as well. If its the same as navi10 how do you know that?
Thanks
@dmuiX where have you stored your vbios, did you extract it as mentioned in isc30's repo? And how does vendor-reset know the location for the vbios, only in the vm configuration file,
hostpci0: 0000:75:00.0,pcie=1,romfile=vbios_6600H.bin
like so?
Thanks
actually the rom of the hdmi audio device was the problem when I imported the correct one the error 43 was gone and I had a working iGPU Passthrough with reset, which is amazing!! Shutdown works just great. If I need to reset the vm its breaking. But this is expected. Havent tested putting the gpu on another vm.
@dmuiX where have you stored your vbios, did you extract it as mentioned in isc30's repo? And how does vendor-reset know the location for the vbios, only in the vm configuration file,
hostpci0: 0000:75:00.0,pcie=1,romfile=vbios_6600H.binlike so?
Thanks
I extracted the vbios and die audio device rom from a bios update of my mainboard using ubu: https://winraid.level1techs.com/t/tool-guide-news-uefi-bios-updater-ubu/3035
after extraction it seems i have converted it as well...so just extraction is not enough
follow here for more infos: https://gist.github.com/matt22207/bb1ba1811a08a715e32f106450b0418a?permalink_comment_id=4955044#gistcomment-4955044
I suppose the vendor-reset doesnt need to know the vbios. Just kvm need to know where to look for and its automatically applied by the vendor-reset when you boot the vm.
the location of the vbios in debian is /usr/share/vgabios. as long as I remember kvm doesnt accept any other location.