kvm-guest-drivers-windows icon indicating copy to clipboard operation
kvm-guest-drivers-windows copied to clipboard

Driver IRQL Not Less or Equal BSOD Error - netkvm.sys - Windows 2016

Open as1988 opened this issue 2 years ago • 16 comments

Issue: Seamingly introducing any network load triggers a BSOD with error: Stop code: DRIVER IRQL NOT LESS OR EQUAL - netkvm.sys

Steps to reproduce the behaviour: Produce network load. It is randomly triggered so it doesn't seem to relate to the volume of the network traffic. Using speedtest.net a few times seems to trigger the problem on the download phase, but I have even seen this replicated when in the middle of a Windows Update download.

image

mini-dump file (renamed ext. from .dmp to .txt): 021722-4953-01.txt

Host:

  • Disto: Centos 7

  • Kernel version: 3.10.0-1160.49.1.el7.x86_64

  • QEMU version - QEMU emulator version 2.12.0 (qemu-kvm-ev-2.12.0-44.1.el7_8.1)

  • QEMU command line: COMMAND /usr/libexec/qemu-kvm -name guest=v10805,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-91-v10805/master-key.aes -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off -cpu host,host-cache-info=on,l3-cache=off -m 6144 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 739e0697-4422-41d8-8b9b-b17f70fa5b08 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=37,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot menu=on,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/vg2/vsv10805-dtoqe9shgmimiyri-ghy87w0aodnookms,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on -netdev tap,fd=47,id=hostnet0,vhost=on,vhostfd=48 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:16:3e:af:23:95,bus=pci.0,addr=0x3 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 188.227.170.27:121,password -k en-gb -device VGA,id=video0,vgamem_mb=8,bus=pci.0,addr=0x2 -device AC97,id=sound0,bus=pci.0,addr=0x4 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on

  • libvirt version: libvirt 4.5.0

  • libvirt XML file v10805.txt

VM:

  • Windows version or commit hash that was used to build the driver: Windows 2016 Server version 1607 build 14393.1946
  • Which driver has a problem: netkvm
  • Driver version: 100.90.104.21500 but the same issue is also observed on 100.80.104.17300

All the virtio drivers from the latest stable are installed: https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/latest-virtio/virtio-win.iso

as1988 avatar Feb 17 '22 23:02 as1988

@as1988 Thank you for reporting the crash.

By any chance do you have also kernel crash dump? Also if you have several mini-dump, it might help as well.

YanVugenfirer avatar Feb 20 '22 08:02 YanVugenfirer

Another mini-dump here: 022022-6750-01.dmp.txt

The kernel memory dump for the above mini-dump is here: https://www.dropbox.com/s/rcnc5i3azg5qt5d/MEMORY.zip?dl=1

as1988 avatar Feb 20 '22 12:02 as1988

@as1988 Sorry, which version of the driver is in the last dump?

YanVugenfirer avatar Feb 20 '22 13:02 YanVugenfirer

@ybendito Looks like bad resource enumeration in SetupInterrruptAffinity

From first minidump:

STACK_TEXT:  
fffff802`d4a85648 fffff802`d2deba29 : 00000000`0000000a 00000000`00000000 00000000`00000002 00000000`00000000 : nt!KeBugCheckEx
fffff802`d4a85650 fffff802`d2de8b6c : 00000000`00001f40 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x69
fffff802`d4a85790 fffff806`64b85378 : ffff8404`e7fdab00 fffff802`d4a85990 00000000`00000000 00000000`00000000 : nt!KiPageFault+0x42c
fffff802`d4a85920 ffff8404`e80b1010 : 00000000`00000000 ffff8404`e7fda000 00000000`00000040 fffff806`64b740cc : netkvm!SetupInterrruptAffinity+0x198 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\wlh\ParaNdis6_Driver.cpp @ 809]
fffff802`d4a859b0 00000000`00000000 : ffff8404`e7fda000 00000000`00000040 fffff806`64b740cc 00000000`00000000 : 0xffff8404`e80b1010

YanVugenfirer avatar Feb 20 '22 13:02 YanVugenfirer

@as1988 Sorry, which version of the driver is in the last dump?

It's on 100.80.104.17300 but the same happens on the latest (stable - 100.90.104.21500) drivers too.

If you need a mini-dump and kernel memory dump when using that driver too I can provide it.

as1988 avatar Feb 20 '22 13:02 as1988

@as1988 No need yet.

YanVugenfirer avatar Feb 20 '22 13:02 YanVugenfirer

@as1988 No need yet.

Here they are anyway: 022022-5328-01.dmp.txt

Kernel memory dump: https://www.dropbox.com/s/hkaaqscwwxeogd0/MEMORY-latestdriver.zip?dl=1

as1988 avatar Feb 20 '22 14:02 as1988

STACK_TEXT:  
fffff800`4f47e2e8 fffff800`4d7faa29 : 00000000`0000000a 00000000`00000000 00000000`00000002 00000000`00000000 : nt!KeBugCheckEx
fffff800`4f47e2f0 fffff800`4d7f7b6c : 00000000`00000000 00000000`ffffffff 00000000`0000003c 00000000`00000002 : nt!KiBugCheckDispatch+0x69
fffff800`4f47e430 fffff80c`dce3ab7b : ffff880f`c2261e80 fffff800`4f47e638 00000000`00000000 00000000`00000000 : nt!KiPageFault+0x42c
fffff800`4f47e5c0 fffff80c`dce255cc : ffff880f`c49d5c00 fffff80c`dce2eec9 ffff880f`c49d5d80 fffff800`4f47e660 : netkvm!ParaNdis6_RSSGetCurrentCpuReceiveQueue+0x4b [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\wlh\ParaNdis6_RSS.cpp @ 936]
fffff800`4f47e600 fffff80c`dce242d4 : ffff880f`c2261000 00000000`00000000 ffff880f`00000040 fffff80c`dce36b50 : netkvm!RxDPCWorkBody+0x30 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_Common.cpp @ 1674]
fffff800`4f47e680 fffff80c`dce36b7c : ffff880f`c2261000 fffff800`4f47e810 00000000`00000000 00000000`00000001 : netkvm!ParaNdis_RXTXDPCWorkBody+0xb4 [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\Common\ParaNdis_Common.cpp @ 1806]
fffff800`4f47e6b0 fffff80c`da824e69 : 00000000`00000000 00000000`000006c0 af38eac3`00dbbeff 6dcff155`b4e379f1 : netkvm!MiniportMSIInterruptDpc+0x2c [C:\cygwin64\tmp\build\source\internal-kvm-guest-drivers-windows\NetKVM\wlh\ParaNdis6_Impl.cpp @ 450]
fffff800`4f47e710 fffff800`4d6f88a5 : ffff880f`c23ec690 00000000`00000217 ffff880f`c1959320 ffff880f`c1959320 : NDIS!ndisInterruptDpc+0x1c9
fffff800`4f47e890 fffff800`4d6f7e30 : ffff880f`c3578900 ffff880f`c3578900 00000000`00140001 00000000`00000000 : nt!KiExecuteAllDpcs+0x335
fffff800`4f47e9e0 fffff800`4d7ee00a : 00000000`00000000 fffff800`4d9ce180 fffff800`4da4a980 ffff880f`c4a0d040 : nt!KiRetireDpcList+0x910
fffff800`4f47ec60 00000000`00000000 : fffff800`4f47f000 fffff800`4f479000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x5a

YanVugenfirer avatar Feb 20 '22 14:02 YanVugenfirer

@as1988 Can you please reproduce the BSOD problem one more time with active logging from the driver:

  1. Please attached batch file into any writeable directory on the VM, rename it to trace.cmd
  2. Run it as an administrator (using right mouse click it is also OK), it will open Trace.cmd.txt a console window, no need to type anything
  3. Reproduce the problem and share the memory.dmp file with us (it will contain the latest logs from the driver) Thank you in advance

ybendito avatar Feb 21 '22 10:02 ybendito

@ybendito No problem, find the memory dump here: https://www.dropbox.com/s/kut29cl0tslfk96/MEMORY-afterbatchfile.zip?dl=1

as1988 avatar Feb 21 '22 10:02 as1988

@ybendito just to let you know this also applies to Windows 2019 I have discovered today.

as1988 avatar Feb 21 '22 16:02 as1988

I have had a quick look and from the etl trace we can see that right before the crash we see set OID_GEN_RECEIVE_SCALE_PARAMETERS:

0: kd> !wmitrace.logdump 0x10
(WmiTrace) LogDump for Logger Id 0x10

DBGHELP: error 0x20 opening C:\dbg\My\DbgHelp.txt
DBGHELP: error 0x20 opening C:\dbg\My\DbgHelp.txt
Found Buffers: 8 Messages: 5, sorting entries
[0]0004.00BC::02/21/2022-11:48:50.659 [netkvm.sys] ParaNdis6_OidRequest[ParaNdis6_OidRequest] OID type 1, id 0x10204(OID_GEN_RECEIVE_SCALE_PARAMETERS) of 612
[0]0004.00BC::02/21/2022-11:48:50.659 [netkvm.sys] ParaNdis6_RSSSetParameters[ParaNdis6_RSSSetParameters] RSS Params: flags 0x0007, hash information 0x1f01
[0]0004.00BC::02/21/2022-11:48:50.659 [netkvm.sys][ParaNdis6_RSSSetParameters] default queue -> -1
[0]0004.00BC::02/21/2022-11:48:50.659 [netkvm.sys] ParaNdis_SetupRSSQueueMap[ParaNdis_SetupRSSQueueMap] Entering, RSS table size = 128, # of path bundles = 1. RSS2QueueLength = 128, RSS2QueueMap =0xFFFF9785332AFC00
[0]0004.00BC::02/21/2022-11:48:50.659 [netkvm.sys] ParaNdis6_OidRequest[ParaNdis6_OidRequest] OID type 1, id 0x10204(OID_GEN_RECEIVE_SCALE_PARAMETERS) (0)

The ParaNdis6_RSSSetParameters function will call ApplySettings to update ActiveRSSScalingSettings from current adapter's RSSSettings->RSSScalingSettings: https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/NetKVM/wlh/ParaNdis6_RSS.cpp#L36

However, at the initial initialization (previous run), the same code (see line 38 https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/NetKVM/wlh/ParaNdis6_RSS.cpp#L38) nullifies CPUIndexMapping.

Here is the state of (Active)RSSScalingSettings:

0: kd> dt netkvm!PARANDIS_ADAPTER ffff978531d03000 RSSParameters.RSSScalingSettings.
   +0xe80 RSSParameters                     : 
      +0x068 RSSScalingSettings                : 
         +0x000 IndirectionTable                  : [128] _PROCESSOR_NUMBER
         +0x200 IndirectionTableSize              : 0x200
         +0x202 QueueIndirectionTable             : [128]  ""
         +0x284 RSSHashMask                       : 0x7f
         +0x288 CPUIndexMapping                   : (null) <<<<<<<<<<<<<<<
         +0x290 CPUIndexMappingSize               : 4
         +0x294 FirstQueueIndirectionIndex        : 0n0
         +0x298 DefaultProcessor                  : _PROCESSOR_NUMBER
         +0x29c DefaultQueue                      : -1 ''
DBGHELP: netkvm is not source indexed
0: kd> dt netkvm!PARANDIS_ADAPTER ffff978531d03000 RSSParameters.ActiveRSSScalingSettings.
   +0xe80 RSSParameters                           : 
      +0x338 ActiveRSSScalingSettings                : 
         +0x000 IndirectionTable                        : [128] _PROCESSOR_NUMBER
         +0x200 IndirectionTableSize                    : 0x200
         +0x202 QueueIndirectionTable                   : [128]  ""
         +0x284 RSSHashMask                             : 0x7f
         +0x288 CPUIndexMapping                         : (null) <<<<<<<<<<<<<
         +0x290 CPUIndexMappingSize                     : 4
         +0x294 FirstQueueIndirectionIndex              : 0n0
         +0x298 DefaultProcessor                        : _PROCESSOR_NUMBER
         +0x29c DefaultQueue                            : -1 ''
DBGHELP: netkvm is not source indexed

I believe that the very primitive fix would be to check whether CPUIndexMapping is NULL and repopulate it:

diff --git a/NetKVM/wlh/ParaNdis6_RSS.cpp b/NetKVM/wlh/ParaNdis6_RSS.cpp
index 535c3835..5904fdf7 100644
--- a/NetKVM/wlh/ParaNdis6_RSS.cpp
+++ b/NetKVM/wlh/ParaNdis6_RSS.cpp
@@ -549,6 +549,12 @@ NDIS_STATUS ParaNdis6_RSSSetParameters( PARANDIS_ADAPTER *pContext,
             *ParamsBytesRead += Params->HashSecretKeySize;
         }

+        if (RSSParameters->RSSScalingSettings.CPUIndexMapping == NULL)
+        {
+            if(!AllocateCPUMappingArray(pContext, &RSSParameters->RSSScalingSettings))
+                return NDIS_STATUS_RESOURCES;
+        }
+
         ApplySettings(RSSParameters,
                     PARANDIS_RSS_MODE::PARANDIS_RSS_FULL,
                     &RSSParameters->RSSHashingSettings,

As a proof of concept, here is a new netkvm binary with the change (you just need to install the driver manually on via Device Manager): https://1drv.ms/u/s!Alp51rbA-o0kmv89c3UVxR3EUQFwoQ?e=QBE8YE

@as1988, could you verify whether you hit the same issue with the above binary? It is signed, but you will have to accept the certificate installation.

Anyway, I'd defer the authoritative feedback to Yuri and the team. I do not know what triggers/can send OID_GEN_RECEIVE_SCALE_PARAMETERS without tearing down the adapter context first to replicate the problem in-house and the code seems to have been there for ages.

sb-ntnx avatar Feb 21 '22 19:02 sb-ntnx

@sb-ntnx I've installed that driver and attempted to replicate the issue. When I do, I get disconnected from RDP as network connectivity is lost. I can then only access the VM again via VNC. A screenshot of what I see when I reconnect: connectivitylost

I'm not sure if it's worth raising but I think I will just in case it is of relevance. When the VM is booted up a script runs to detect the quantity of CPUs and the following is applied via a powershell script to set RSS queue and max processor quantity. This is required as core counts can increase/decrease so we want the queues to be utilised fully to match the quantity of CPU cores present on the VM at that time: "Set-NetAdapterrss -name Ethernet -NumberOfReceiveQueues $corecount -MaxProcessors $corecount -Profile Conservative" Obviously there is no problem with this code but it's potentially of some help to you.

as1988 avatar Feb 21 '22 20:02 as1988

@as1988, thank you. I immediately hit the issue after running and then generating a bit of network traffic. Might be due to Conservative profile.

sb-ntnx avatar Feb 22 '22 08:02 sb-ntnx

@as1988, can you try this binary https://1drv.ms/u/s!Alp51rbA-o0kmv8-FpvMWevJIbrYUQ? It fixed the issue for me in the lab.

It has a bit different approach not to touch RSSScalingSettings if not requested:

diff --git a/NetKVM/wlh/ParaNdis6_RSS.cpp b/NetKVM/wlh/ParaNdis6_RSS.cpp
index 535c3835..c16938ee 100644
--- a/NetKVM/wlh/ParaNdis6_RSS.cpp
+++ b/NetKVM/wlh/ParaNdis6_RSS.cpp
@@ -28,7 +28,7 @@ static VOID ApplySettings(PPARANDIS_RSS_PARAMS RSSParameters,
     {
         RSSParameters->ActiveHashingSettings = *ReceiveHashingSettings;

-        if(NewRSSMode == PARANDIS_RSS_MODE::PARANDIS_RSS_FULL)
+        if(NewRSSMode == PARANDIS_RSS_MODE::PARANDIS_RSS_FULL && ReceiveScalingSettings != NULL)
         {
             if(RSSParameters->ActiveRSSScalingSettings.CPUIndexMapping != NULL)
                 NdisFreeMemory(RSSParameters->ActiveRSSScalingSettings.CPUIndexMapping, 0, 0);
@@ -532,7 +532,7 @@ NDIS_STATUS ParaNdis6_RSSSetParameters( PARANDIS_ADAPTER *pContext,
         }
         else
         {
-            SetDefaultQueue(&RSSParameters->RSSScalingSettings, defaultCPUIndex);
+            SetDefaultQueue(&RSSParameters->ActiveRSSScalingSettings, defaultCPUIndex);
             TraceNoPrefix(0, "[%s] default queue -> %d\n", __FUNCTION__, RSSParameters->ActiveRSSScalingSettings.DefaultQueue);
         }

@@ -552,7 +552,7 @@ NDIS_STATUS ParaNdis6_RSSSetParameters( PARANDIS_ADAPTER *pContext,
         ApplySettings(RSSParameters,
                     PARANDIS_RSS_MODE::PARANDIS_RSS_FULL,
                     &RSSParameters->RSSHashingSettings,
-                    &RSSParameters->RSSScalingSettings);
+                    (!(Params->Flags & NDIS_RSS_PARAM_FLAG_ITABLE_UNCHANGED)) ? &RSSParameters->RSSScalingSettings : NULL);
     }

     *ParamsBytesRead += minimalLength;

sb-ntnx avatar Feb 22 '22 10:02 sb-ntnx

@sb-ntnx Great, this appears to fix the problem. When I initially installed it I saw really low network throughput on the RX side when tested, however after a reboot the throughput is where I expect things to be, not sure if that's just a coincidence though. What are the next steps before classifying this as a safe fix? Apologies I'm not familiar with the process on these sorts of things :)

as1988 avatar Feb 22 '22 11:02 as1988