vendor-reset icon indicating copy to clipboard operation
vendor-reset copied to clipboard

Instinct MI100 cluster fails to reset on restart

Open TNT3530 opened this issue 1 year ago • 2 comments

ProxMox 7.3-3, Kernel 5.15.53-1-pve

applied the changes here to get it functioning with this kernel, double checking that all PCIe device reset_method values are correctly device_specific

First guest boot shows image but all GPUs pass through fine

Attempting to shutdown and restart the guest causes this: image ending in the guest failing to boot with atombios stuck in loop for more than 20secs aborting image

TNT3530 avatar May 03 '24 18:05 TNT3530

Your method of setting the reset to device specific is not supported, you are supposed to use the udev rules as provided in the project. Your service may be running too late and the inbuilt reset may have already been used at some point during boot.

If this does not solve the problem, I am sorry but there is not much else we can do here.

gnif avatar May 08 '24 00:05 gnif

I have the dkms module loaded in the proxmox host image

and activated in my /etc/modules image

with the service disabled, here is the initial boot image

And all GPUs pass-through fine.

Upon restarting in the guest, this is what spits out image

searching dmesg | grep reset returns nothing other than the above and a few USB devices, and dmesg | grep vfio has no new lines so i assume it isn't running

Moving the vendor-reset in /etc/modules to the first line does the same thing as above, but with the bonus of image

TNT3530 avatar May 08 '24 18:05 TNT3530