virtualbox-kvm icon indicating copy to clipboard operation
virtualbox-kvm copied to clipboard

Code 43 in guest when passing through NVIDIA GPU

Open AnErrupTion opened this issue 1 year ago • 13 comments

Bug Description

When following the guide over here, adapting it to passthrough a dedicated GPU, a code 43 error can be observed after installing the GPU drivers in the guest system using Device Manager.

How to Reproduce

  1. Follow the previously linked guide, ensuring that:
  • vfio-pci is correctly bound to the GPU
  • The proper memlock modifications are done in /etc/security/limits.conf
  • The proper permissions are set throughout /dev/vfio/*
  • The VFIO device is attached to the guest using --attachvfio
  1. Install the GPU drivers, here using the latest NVIDIA 560.81 drivers
  2. Reboot and observe the code 43 error (NOTE: a pretty long ~5-6 seconds freeze can also be observed when booting up the VM. I'm assuming it tries to load the NVIDIA driver but fails to do so)

VM configuration

Guest OS configuration details:

  • Guest OS type and version (e.g. Windows 10 22H2): Windows 11 23H2
  • Attach guest VM configuration file from VirtualBox VMs/<guest VM name>/<guest VM name>.vbox: Windows 11.vbox.zip

Host OS details:

  • Host OS distribution: Arch Linux
  • Host OS kernel version: Linux shininglea 6.10.4-arch2-1 #1 SMP PREEMPT_DYNAMIC Sun, 11 Aug 2024 16:19:06 +0000 x86_64 GNU/Linux

Logs

AnErrupTion avatar Aug 14 '24 14:08 AnErrupTion

I see the split lock detection triggers in your dmesg log. That will cause issues for the VM, up to the point where it may not make any progress. I am not sure whether that is the root cause of your issue, but please try the recommendation from the README and see if it helps:

Starting with Intel Tiger Lake (11th Gen Core processors) or newer, split lock detection must be turned off in the host system. This can be achieved using the Linux kernel command line parameter split_lock_detect=off or using the split_lock_mitigate sysctl.

snue avatar Aug 14 '24 15:08 snue

I see the split lock detection triggers in your dmesg log. That will cause issues for the VM, up to the point where it may not make any progress. I am not sure whether that is the root cause of your issue, but please try the recommendation from the README and see if it helps:

Starting with Intel Tiger Lake (11th Gen Core processors) or newer, split lock detection must be turned off in the host system. This can be achieved using the Linux kernel command line parameter split_lock_detect=off or using the split_lock_mitigate sysctl.

I was pretty sure I had already disabled it. But, either way, adding the command line parameter didn't do anything, although I now see this in dmesg:

 Unknown kernel command line parameters "split_lock_detect=off", will be passed to user space.

But I also see x86/split lock detection: disabled earlier in the log, so I'm assuming it's actually disabled now.

AnErrupTion avatar Aug 14 '24 15:08 AnErrupTion

@snue is correct.

Here we have it

[ 2109.050169] x86/split lock detection: #AC: EMT-0/4675 took a split_lock trap at address: 0xfffff8021f251f4f

Unknown kernel command line parameters "split_lock_detect=off", will be passed to user space.

Yes, this is expected.

But I also see x86/split lock detection: disabled earlier in the log, so I'm assuming it's actually disabled now.

Sounds about right. Did it solve your issue?

tpressure avatar Aug 14 '24 15:08 tpressure

@snue is correct.

Here we have it

[ 2109.050169] x86/split lock detection: #AC: EMT-0/4675 took a split_lock trap at address: 0xfffff8021f251f4f

Unknown kernel command line parameters "split_lock_detect=off", will be passed to user space.

Yes, this is expected.

But I also see x86/split lock detection: disabled earlier in the log, so I'm assuming it's actually disabled now.

Sounds about right. Did it solve your issue?

Unfortunately, it didn't solve the issue.

AnErrupTion avatar Aug 14 '24 15:08 AnErrupTion

@AnErrupTion can you post new logs with split lock disabled?

tpressure avatar Aug 14 '24 15:08 tpressure

Ah yes, my bad. Here they are:

dmesg.log Windows 11-2024-08-14-17-15-07.log

AnErrupTion avatar Aug 14 '24 15:08 AnErrupTion

It looks a little bit better and the guest is definitively trying to use the GPU:

00:00:07.099476 VFIO: RegisterBar 0xf0000000 
00:00:07.099500 VFIO: RegisterBar 0x800000000 
00:00:07.099501 VFIO: RegisterBar 0x900000000 
00:00:07.099503 VFIO: RegisterBar 0x6000 
00:00:07.099809 VFIO: Activate MSI count: 1

and

[   43.766761] vfio-pci 0000:01:00.0: enabling device (0000 -> 0003)

I assume this card needs some kind of quirk. I can maybe look into this in a couple of weeks.

Can you upload the output of lspci -vvvn please?

tpressure avatar Aug 14 '24 15:08 tpressure

I assume this card needs some kind of quirk. I can maybe look into this in a couple of weeks.

I'm not sure if it does, since passing through the same GPU with QEMU works just fine (no additional quirks needed or shenanigans).

Can you upload the output of lspci -vvvn please?

Alright, here's the output (when ran as root): lspci.log

AnErrupTion avatar Aug 14 '24 15:08 AnErrupTion

I'm not sure if it does, since passing through the same GPU with QEMU works just fine (no additional quirks needed or shenanigans).

Qemu automatically applies the necessary quirks when it detects a card that needs them

tpressure avatar Aug 14 '24 16:08 tpressure

I'm not sure if it does, since passing through the same GPU with QEMU works just fine (no additional quirks needed or shenanigans).

Qemu automatically applies the necessary quirks when it detects a card that needs them

Is there a way of knowing which ones does it apply? I can fire up a QEMU VM if needed.

AnErrupTion avatar Aug 14 '24 16:08 AnErrupTion

Also, I guess I forgot to mention one interesting bit: when I went to check for updates in the VM, Windows Update did not download the NVIDIA driver and I had to download it manually (but then it installed fine afterwards). And, when I went to Device Manager, it said that the driver used is not the same one as the POSTed graphics driver, or something like this. None of this happened with QEMU either.

AnErrupTion avatar Aug 14 '24 16:08 AnErrupTion

There are quite some nvidia quirks in QEMU. The quirky MSI handling is an obvious suspect, but so is the mirrored config space access in general. See this background discussion: https://patchwork.kernel.org/project/qemu-devel/patch/[email protected]/

Just maybe, you can force the GPU into legacy interrupt mode instead of MSI in the Windows VM to try and work around that?

snue avatar Aug 14 '24 17:08 snue

There are quite some nvidia quirks in QEMU. The quirky MSI handling is an obvious suspect, but so is the mirrored config space access in general. See this background discussion: https://patchwork.kernel.org/project/qemu-devel/patch/[email protected]/

Just maybe, you can force the GPU into legacy interrupt mode instead of MSI in the Windows VM to try and work around that?

I have tried to disable MSI by setting MSISupported in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\PCI\VEN_10DE&DEV_25A2&SUBSYS_13FC1043&REV_A1\3&267a616a&0&80\Device Parameters\Interrupt Management\MessageSignaledInterruptProperties to 0 instead of 1, but unfortunately, the problem still persists. One interesting thing though is that, in the utility I was using (MSI mode utility v3.1), my GPU doesn't actually appear on the list of devices, even though it's present in the registry and it also supports MSI (though that last part shouldn't matter because devices that don't support MSI also appear in the program's list):

image

AnErrupTion avatar Aug 14 '24 18:08 AnErrupTion