KoboldAI-Client icon indicating copy to clipboard operation
KoboldAI-Client copied to clipboard

Latest stable release can cause a BSOD on Windows 10 if KB5022834 is installed

Open levicki opened this issue 2 years ago • 3 comments

I have been using KoboldAI for a couple of months without issues. I have been updating it occasionaly (option 1, stable release) and never had any problems so far.

Today after updating KoboldAI, I was running a chat using KoboldAI_OPT-6.7B-Erebus model on my RTX 4090, and out of nowhere I got a BSOD.

Bugcheck code is 0x1A (MEMORY_MANAGEMENT) and the culprit seems to be dxgmms2.sys. Here is the memory dump analysis:

BUGCHECK_CODE:  1a
BUGCHECK_P1: 41790
BUGCHECK_P2: ffff950010cebb80
BUGCHECK_P3: 0
BUGCHECK_P4: 1

PROCESS_NAME:  python.exe

STACK_TEXT:  
ffff858a`88b1d358 fffff804`21633bf7     : 00000000`0000001a 00000000`00041790 ffff9500`10cebb80 00000000`00000000 : nt!KeBugCheckEx
ffff858a`88b1d360 fffff804`215f14d8     : 00000000`00000001 00002000`00000000 00000000`00000001 000001c4`1e50f000 : nt!MiDecreaseUsedPtesCount+0x1f08f3
ffff858a`88b1d3a0 fffff804`214b6233     : ffff988f`354dd700 ffff988f`354dd700 ffff988f`00000002 ffffb9dc`c0710740 : nt!MiReducePteUseCount+0x30
ffff858a`88b1d3d0 fffff804`218fb31d     : ffff988f`37daa080 ffff988f`3ebe19e8 ffffe200`00000000 fffff804`00000000 : nt!MiDecommitPages+0xf93
ffff858a`88b1dfa0 fffff804`218fa983     : ffff858a`88b1e120 00000000`00000000 ffff858a`88b1e120 ffff988f`3ebe19c0 : nt!MiDecommitRegion+0x7d
ffff858a`88b1e020 fffff804`218fa275     : 00000000`00000000 00000000`00000090 00000000`00000103 00000000`00000000 : nt!MmFreeVirtualMemory+0x6d3
ffff858a`88b1e170 fffff804`2160d8f5     : ffff988f`37daa080 00000000`00000000 00000000`00000000 ffff988f`3d5b4f90 : nt!NtFreeVirtualMemory+0x95
ffff858a`88b1e1d0 fffff804`215fec70     : fffff804`3ab60e9c ffff988f`37daa080 ffff988f`3e89b730 ffff988f`37daa8b0 : nt!KiSystemServiceCopyEnd+0x25
ffff858a`88b1e368 fffff804`3ab60e9c     : ffff988f`37daa080 ffff988f`3e89b730 ffff988f`37daa8b0 fffff804`3ab63707 : nt!KiServiceLinkage
ffff858a`88b1e370 fffff804`3ab6434a     : 00000000`04000000 ffffa889`d77ccc88 ffffa889`d77cc0e0 ffff988f`35e468e0 : dxgmms2!VIDMM_RECYCLE_RANGE::DebouncedDecommit+0x160
ffff858a`88b1e3c0 fffff804`3abc26a8     : 00000000`00000000 ffff988f`35e46f18 ffffa889`d0424101 ffffa889`d0422dd8 : dxgmms2!VIDMM_RECYCLE_HEAP_MGR::ProcessDebounceList+0x12a
ffff858a`88b1e430 fffff804`3ab65b11     : ffff988f`35e46e10 ffffa889`d0423010 ffff988f`3d6dc8a0 ffffa889`d7598b01 : dxgmms2!VIDMM_RECYCLE_HEAP_MGR::ProcessDebounceListsGlobally+0xd0
ffff858a`88b1e4c0 fffff804`3ab886ee     : ffff988f`3d794d10 ffffa889`cfea12a0 ffffa889`d7598b70 00000000`00000000 : dxgmms2!VIDMM_RECYCLE_HEAP_MGR::Free+0x211
ffff858a`88b1e500 fffff804`3ab73f0a     : ffff988f`3d794d10 00000000`00000003 ffff988f`3d794d10 ffffa889`d7598b70 : dxgmms2!VIDMM_GLOBAL::UncommitLocalBackingStore+0x172
ffff858a`88b1e590 fffff804`3ab73bf5     : 00000000`00000000 ffffa889`80017740 ffff858a`80017780 ffff988f`3d794d10 : dxgmms2!VIDMM_GLOBAL::CloseOneAllocation+0x27a
ffff858a`88b1e6d0 fffff804`3ab11eaa     : 00000000`00000003 ffffa889`cf5dfb30 ffff858a`88b1e860 00000000`00000001 : dxgmms2!VIDMM_GLOBAL::CloseAllocation+0xc5
ffff858a`88b1e720 fffff804`3aed62ba     : ffff988f`3d794d58 ffff858a`00000000 ffffa889`d464fc00 00000000`00000001 : dxgmms2!VidMmCloseAllocation+0x1a
ffff858a`88b1e760 fffff804`3aed5d8e     : 00000000`00000003 ffffa889`d47f57e0 00000000`00000003 00000000`00000000 : dxgkrnl!DXGDEVICE::DestroyAllocations+0x4c6
ffff858a`88b1e900 fffff804`3aeb8cd3     : 00000000`00000003 00000000`00000000 ffffa889`d47f57e0 ffffa889`d47f57e0 : dxgkrnl!DXGDEVICE::DestroyResource+0x5e
ffff858a`88b1e940 fffff804`3aeb8131     : 00000000`04000000 ffffa889`d464fc80 ffffa889`00000001 ffffa889`d48993a0 : dxgkrnl!DXGDEVICE::TerminateAllocations+0x8c3
ffff858a`88b1e9e0 fffff804`3aeb989a     : ffff988f`38502598 ffff858a`88b1eef0 00000000`00000000 ffffa889`d47f57e0 : dxgkrnl!DxgkDestroyAllocationInternal+0xe41
ffff858a`88b1edf0 fffff804`3aeb9c3b     : 00000038`8ade9598 000001c0`52d36280 00000000`00000000 00000000`00000000 : dxgkrnl!DxgkDestroyAllocationHelper+0x9ba
ffff858a`88b1f2e0 fffff804`2160d8f5     : 00007ffa`396c0450 ffff988f`37daa080 00000000`00000002 00000000`00000001 : dxgkrnl!DxgkDestroyAllocation2+0x21b
ffff858a`88b1f3c0 00007ffa`611449c4     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x25
00000038`8ade9558 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00007ffa`611449c4

SYMBOL_NAME:  dxgmms2!VIDMM_RECYCLE_RANGE::DebouncedDecommit+160
MODULE_NAME: dxgmms2
IMAGE_NAME:  dxgmms2.sys
IMAGE_VERSION:  10.0.19041.2311
STACK_COMMAND:  .cxr; .ecxr ; kb
BUCKET_ID_FUNC_OFFSET:  160
FAILURE_BUCKET_ID:  0x1a_41790_dxgmms2!VIDMM_RECYCLE_RANGE::DebouncedDecommit
OS_VERSION:  10.0.19041.1
BUILDLAB_STR:  vb_release
OSPLATFORM_TYPE:  x64
OSNAME:  Windows 10

Accoding to Microsoft article linked above, the P1 value of 41790 means:

A page table page has been corrupted. On a 64-bit version of Windows, parameter 2 contains the address of the PFN for the corrupted page table page.

I understand that this is most likely a bug in either Microsoft's DXGI Memory Management System (which is what dxgmms2.sys is part of) or in the NVIDIA drivers.

I have not updated NVIDIA drivers on my system recently (I am using NVIDIA Studio Driver 528.24 without issues for quite a while).

However, the latest Windows Update KB5022834 from February 14, 2023 has apparently updated dxgmms2.sys (CSV with the list of files changed by the update can be downloaded here).

Note that KB5022834 dxgmms2.sys file has the same version and build number (!), and that the only way to check which one you have is by checking the file size — the updated file size is 831,872 bytes, and the previous file size is 902,992 bytes. You can also compare the file digital signature's counter-sign timestamp, newer file should have a newer date.

I have uninstalled the update, and I will now test again to see if the BSOD happens again.

Hopefully this helps others as well.

levicki avatar Feb 18 '23 19:02 levicki

After uninstalling the KB5022834 I played for 3 hours straight without encountering a BSOD again.

It is therefore very likely that the Windows Update is to blame.

levicki avatar Feb 18 '23 23:02 levicki

Here is a VBScript to hide the update to stop it from installing until Microsoft devs get off their a... I mean hands, and fix the issue:

If Wscript.Arguments.Count = 0 Then
    WScript.Echo "Syntax: HideWindowsUpdateById.vbs [Update ID]" & vbCRLF & _
                 "Examples:" & vbCRLF & _
                 "  - Hide KB940157: HideWindowsUpdateById.vbs 2ba85467-deaf-44a1-a035-697742efab0f"
    WScript.Quit 1
End If

Dim updateId
updateId = WScript.Arguments(0)

Dim updateSession, updateSearcher
Set updateSession = CreateObject("Microsoft.Update.Session")
Set updateSearcher = updateSession.CreateUpdateSearcher()

Wscript.Stdout.Write "Searching for pending updates..." 
Dim searchResult
Set searchResult = updateSearcher.Search("UpdateID = '" & updateId & "'")

Dim update, index
WScript.Echo CStr(searchResult.Updates.Count) & " found."
For index = 0 To searchResult.Updates.Count - 1
    Set update = searchResult.Updates.Item(index)
    WScript.Echo "Hiding update: " & update.Title
    update.IsHidden = True
Next

The UpdateID for KB5022834 is 6cc9192c-bb72-45fb-b141-bc2b01aa35aa.

levicki avatar Feb 19 '23 17:02 levicki

I submitted a question on Microsoft forums for this issue:

https://learn.microsoft.com/en-us/answers/questions/1187297/kb5022834-causing-bsod-memory-management-in-dxgmms

levicki avatar Mar 07 '23 12:03 levicki