ollama icon indicating copy to clipboard operation
ollama copied to clipboard

The Windows (preview) version causes Windows 11 crash with DPC_WATCHDOG_VIOLATION (133)

Open binxie33 opened this issue 2 years ago • 4 comments

What is the issue?

I am running the Windows (preview) version on Windows 11 with Nvidia 4070Ti (12GB GPU memory).

The Nvidia driver is latest version 552.22, and Cuda is latest version 12.4.1. When answering some questions with relative lengthy outputs, the whole computer hang / crash with the error DPC_WATCHDOG_VIOLATION (133) . I have tried different models like deepseek-coder / wizardlm2 and encountered the same problem.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.1.32

binxie33 avatar Apr 19 '24 02:04 binxie33

This sounds like it may be an NVIDIA driver bug, or possibly hardware fault. Did you get a BSOD? Did it report which driver was hung?

dhiltgen avatar Apr 19 '24 21:04 dhiltgen

Yes, it got BSOD after hanging for a while. The following is the bugcheck detail. It points to nvlddmkm.sys which is nvidia display driver. I already installed the latest nvidia driver and it did not help.
`10: kd> !analyze -v


  •                                                                         *
    
  •                    Bugcheck Analysis                                    *
    
  •                                                                         *
    

DPC_WATCHDOG_VIOLATION (133) The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL or above. Arguments: Arg1: 0000000000000001, The system cumulatively spent an extended period of time at DISPATCH_LEVEL or above. Arg2: 0000000000001e00, The watchdog period (in ticks). Arg3: fffff80069f1c340, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which contains additional information regarding the cumulative timeout Arg4: 0000000000000000

Debugging Details:

BUGCHECK_CODE: 133

BUGCHECK_P1: 1

BUGCHECK_P2: 1e00

BUGCHECK_P3: fffff80069f1c340

BUGCHECK_P4: 0

FILE_IN_CAB: 041824-8484-01.dmp

DUMP_FILE_ATTRIBUTES: 0x1808 Kernel Generated Triage Dump

DPC_TIMEOUT_TYPE: DPC_QUEUE_EXECUTION_TIMEOUT_EXCEEDED

BLACKBOXBSD: 1 (!blackboxbsd)

BLACKBOXNTFS: 1 (!blackboxntfs)

BLACKBOXPNP: 1 (!blackboxpnp)

BLACKBOXWINLOGON: 1

CUSTOMER_CRASH_COUNT: 1

PROCESS_NAME: Code.exe

STACK_TEXT:
ffffd2013256b9d8 fffff800694e3739 : 0000000000000133 0000000000000001 0000000000001e00 fffff80069f1c340 : nt!KeBugCheckEx ffffd2013256b9e0 fffff800694e2884 : 0000be4794caa2e8 ffffd20132551180 00000000003bc48e 0000000000000000 : nt!KeAccumulateTicks+0x239 ffffd2013256ba40 fffff800694e453f : 000000000000001c 0000000000001388 00000000003bc400 0000000000239fd7 : nt!KiUpdateRunTime+0xf4 ffffd2013256bc00 fffff800694e08f8 : 0000000000000000 ffffd2013644da00 ffffd20132551180 0000000000000000 : nt!KiUpdateTime+0x63f ffffd2013256bea0 fffff800694e01ba : fffff80069e5fe60 ffffd2013644dab0 ffffd2013644dab0 0000000000000002 : nt!KeClockInterruptNotify+0x228 ffffd2013256bf40 fffff80069467e5c : 0000008e7f62ee25 ffffe706d81528a0 ffffe706d8152950 fffff8006961a38b : nt!HalpTimerClockInterrupt+0x10a ffffd2013256bf70 fffff8006961a5ea : ffffb98693716d10 ffffe706d81528a0 0000000000900494 0000000000000000 : nt!KiCallInterruptServiceRoutine+0x9c ffffd2013256bfb0 fffff8006961aeb7 : 000000000090047c fffff8006961aec4 0000000000900490 ffffb98693716e38 : nt!KiInterruptSubDispatchNoLockNoEtw+0xfa ffffb98693716c90 fffff800966dd510 : fffff800966dda4a ffffe706ea9c7000 fffff80096704fde ffffe706ea9c7bf0 : nt!KiInterruptDispatchNoLockNoEtw+0x37 ffffb98693716e28 fffff800966dda4a : ffffe706ea9c7000 fffff80096704fde ffffe706ea9c7bf0 ffffb98693716e60 : nvlddmkm+0xed510 ffffb98693716e30 ffffe706ea9c7000 : fffff80096704fde ffffe706ea9c7bf0 ffffb98693716e60 ffffe70600000000 : nvlddmkm+0xeda4a ffffb98693716e38 fffff80096704fde : ffffe706ea9c7bf0 ffffb98693716e60 ffffe70600000000 fffff80000000020 : 0xffffe706ea9c7000 ffffb98693716e40 ffffe706ea9c7bf0 : ffffb98693716e60 ffffe70600000000 fffff80000000020 ffffb98693716ea0 : nvlddmkm+0x114fde ffffb98693716e48 ffffb98693716e60 : ffffe70600000000 fffff80000000020 ffffb98693716ea0 0000000000000000 : 0xffffe706ea9c7bf0 ffffb98693716e50 ffffe70600000000 : fffff80000000020 ffffb98693716ea0 0000000000000000 0000000000000000 : 0xffffb98693716e60 ffffb98693716e58 fffff80000000020 : ffffb98693716ea0 0000000000000000 0000000000000000 0000000000900494 : 0xffffe70600000000 ffffb98693716e60 ffffb98693716ea0 : 0000000000000000 0000000000000000 0000000000900494 0000000000000000 : 0xfffff80000000020 ffffb98693716e68 0000000000000000 : 0000000000000000 0000000000900494 0000000000000000 0000000000000000 : 0xffffb98693716ea0

SYMBOL_NAME: nvlddmkm+ed510

MODULE_NAME: nvlddmkm

IMAGE_NAME: nvlddmkm.sys`

binxie33 avatar Apr 24 '24 01:04 binxie33

How much system RAM do you have? How big is the model you've been loading in to ollama?

mtavenrath avatar May 03 '24 07:05 mtavenrath

@binxie33 our current theory is you ran low on system memory trying to load a large model, possibly with VRAM paging taking place on the GPU. If you can repro this failure, can you run nvidia-smi to see if the GPU's VRAM is ~full, and systeminfo | find "Virtual Memory" may also help shed some light on what's going on. We'd also like to understand how much physical RAM you have in the system.

dhiltgen avatar May 21 '24 18:05 dhiltgen

If you're still having problems, please share the information above and I'll re-open.

dhiltgen avatar May 31 '24 21:05 dhiltgen