System interrupts consume 100% CPU
Hi,
System interrupts consume 100% CPU when booting the OS from BugChecker boot entry with DSE disabled. Happens soon after system start. The OS becomes very slow and freezes when I try to load BugChecker.sys from SymLoader. Also, it fails to detect frame buffer location but I guess it's a different issue... So let's start with the interrupts consumption. I dumped memory using livekd and checked !dpcs command (no DPCs), ran !process 0 1f to detect if some threads serving ISRs (only found one nt!HalpInterruptSendIpi call).
What can be a problem here? Feel free ask running commands on the dump.
I ran it on bare metal Windows 10 22H2 64 bit.
Hi,
BugChecker is not so well tested on bare metal, however I guess that this problem is linked to this other issue: https://github.com/vitoplantamura/BugChecker/issues/2 , since you said that the problem happens before loading the main sys driver.
Moreover, since you said that you found a reference to the HalpInterruptSendIpi function, my guess is that the OS sends a "state change" to the kernel debugger, and the debugger doesn't respond accordingly (because the BugChecker main sys driver is not loaded yet).
So, to sum up, the problem is in the KDCOM project: if you look at the KdSendPacket function there, you'll see that it is almost empty (it simply forwards the event to the main BugChecker driver, if it is loaded and attached). So a possible approach to get a better understanding of what's going on in your test case is to log the arguments passed to KdSendPacket in the KDCOM dll (I bet it's a StateChange64 event).
--Vito
BugChecker is not so well tested on bare metal
Oh, that's sad. Local debugging is a very important advantage, it would be nice to have it.
because the BugChecker main sys driver is not loaded yet
If this is correct, loading the driver ASAP should solve the problem, right?
So a possible approach to get a better understanding of what's going on in your test case is to log the arguments passed to KdSendPacket in the KDCOM dll (I bet it's a StateChange64 event).
From your intuition/approximation: how hard is to implement full-fledged KdSendPacket function? Or at least something that is more or less stable on bare metal?
Yes, if you can, try to load the driver ASAP after system restart.
In order to get the framebuffer details, since the auto detect feature fails, you can get the address from the Device Manager; set width and height to your screen dimensions and set stride to 0. Don't forget to disable the display drivers before doing this test (more details in the main README.md).
For the KDCOM KdSendPacket fix, I'll have some free time next month.
Hi Vito,
I managed to start the driver before the storm but it didn't help, interrupts have started consuming CPU time ~1-2 minute after system start. I made one more dump and will examine it but it feels that there is a more fundamental issue here.
-- Michael.
Hi Michael, thank you for your feedback.
A question: after starting the driver, did you manage to enter in the BugChecker UI successfully (and to resume system execution afterwards)?
I'm trying to understand if this interrupt problem is triggered by some external event that happens exactly 2-3 mins after boot, regardless of your interaction with BugChecker.
No, I didn't even try to break in. Frame buffer autodetect doesn't work though I see that NativeUtil detects memory resources correctly (at least they match Device manager's info). And I am not sure what is the resolution when the driver is disabled. So, no break in attempt so far.
Looks more like an external event that triggers the slowdown.
Can you try to break into BC (PrintScr key), then to exit from the UI (F5 key) and then to wait until the interrupt problem happens?
In order to configure the framebuffer address, you should try all the addresses returned by NativeUtil.
Thanks, --Vito
Message ID: @.***>
Hi Vito,
Got a lot of progress with BugChecker. What I did:
- Disabled video drivers, set framebuffer parameters to: 640 * 480, address from resources.txt, stride = 0
- Rebooted from BC menu
- Immediately after start ran SymLoader and started the driver
- Broke in. The OS has stopped but no BC UI appeared
- Pressed F5. The OS unfrozen and continued working
So, no more interrupt storm! During the test I waited a bit and stopped and started the OS a few times, everything worked fine. It looks that stop-continue workaround prevents the storm.
After that I tried framebuffer autodetection and it worked with no error messages though detected parameters look incorrect. BugChecker detected the resolution of 3840*2400 with 15360 stride. This is my working resolution but it's definitely not what I see in VGA mode. Note, that I run BugChecker on a laptop with two graphic cards: built-in Intel and NVidia. For some reasons NativeUtil only detected Intel card before today, so only I tried 640*480 with 2 memory buffers belonging to Intel card. The UI did not appear. I believe (though not sure) that the laptop uses Intel to output to the built-in display.
-- Michael
Perfect! Thank you Michael!
This is exactly the same problem that occurs in the issue https://github.com/vitoplantamura/BugChecker/issues/2 , i.e. an INT3 is triggered somewhere after 2/3 mins after boot. INT3s require special handling by the kernel debugger, or, once you hit them, the system gets stuck in an infinite StateChange64 loop (which manifests itself as our IPI interrupt storm). I'll fix the bug as soon as I have some free time.
Furthermore the "stop-continue" workaround is not necessary: just start the BugChecker driver with "some" framebuffer info, correct or incorrect doesn't matter.
For the framebuffer, I seem to recall that 640480 is too low to render the BC UI... can you try with 800600? Even if it is not the correct resolution, we will have confirmation that the problem is that 640*480 is too low. If the framebuffer address is correct, in the worst case garbage will appear on the screen...
thanks, --Vito
just start the BugChecker driver with "some" framebuffer info, correct or incorrect doesn't matter.
Awesome!
I played with the framebuffer trying different resolutions and strides. Noticed that:
- NativeUtil sometimes detects NVidia, sometimes it doesn't
- Built-in Intel with resolution 800*600 and higher and different strides do print garbage on the screen. Usually F5 works out but sometimes the system gets halted.
Anyway, looks like Intel is the right lead but I need to find the right resolution and, the most important, the right stride. Do you know how can I do it?
Looks one of the problems is that EnumDisplaySettingsA returns incorrect results when video drivers are unloaded. On my bare metal Windows it does not return the actual resolution but the one that I set (4K), even when the drivers are disabled. This is also what I see when opening display settings.
So you are unable to determine your screen resolution when the display drivers are disabled, even with Windows' display properties, right?
PS: for the stride, try with 0
I tried stride 0 too, no UI, just some garbage.
So you are unable to determine your screen resolution when the display drivers are disabled, even with Windows' display properties, right?
Yes. I tried calling EnumDisplaySettingsW with iModeNum 0, 1, ..., instead of ENUM_CURRENT_SETTINGS with the drivers disabled. Looks like there is only one mode "0" and it reports unbelievably high resolution of 4K.
I have a feeling that calling EnumDisplaySettingsExW with EDS_RAWMODE flags can solve the issue, I will try it later today.
You are about to hear the dumbest root cause ever. You have been warned.
4K resolution returned by EnumDisplaySettings was correct. Stride detected by BugChecker was correct too. What made the screen look weird and trashed video buffer was screen scaling set to 300%. Setting it back to 100% fixed the issue, BugChecker's UI is now visible.
When graphic drivers are disabled Windows uses MS built-in driver called BasicDisplay.sys. Surprisingly, it supports pretty high resolutions, like 4K. Also, Windows enables 300% rescaling which I didn't notice. 4K rescaled 3 times makes the screen look very weird, like some basic 800*600 VGA. Also, because everything on the screen gets bigger, it's hard to notice that rescaling is turned on. And I didn't even think that it can be on by default. I guess this notice should be added to the manual. Colored red, bold font.
By the way, BasicDisplay.sys is pretty small, ~150 functions, symbols available. Not sure it can be used by BugChecker, it seems only exporting DX interface, but maybe its frame buffer detection code can a source of inspiration.
Ah ok, thank you Michael.
At the time I did an extensive research on the possibility of hooking into WDDK in order to get the characteristics of the video framebuffer, but the current approach of BugChecker, which is doing a memory scan in an attempt to guess the beginning of the framebuffer (more precisely of the first and second video line), should be the only one possible, unfortunately.
Thank you, --Vito