rust-hypervisor-firmware
rust-hypervisor-firmware copied to clipboard
Windows guest bluescreen with hypervisor-fw
Windows guest with CH using hypervisor-fw instead of OVMF doesn't shutdown correctly and encounters a bluescreen:
SAC>
The SAC will become unavailable soon. The computer is shutting down.
SAC><?xml><BP>
<INSTANCE CLASSNAME="BLUESCREEN">
<PROPERTY NAME="STOPCODE" TYPE="string"><VALUE>"0x7E"</VALUE></PROPERTY><machine-info>
<name>WIN-L3C8M6IQS0Q</name>
<guid>00000000-0000-0000-0000-000000000000</guid>
<processor-architecture>AMD64</processor-architecture>
<os-version>10.0</os-version>
<os-build-number>17763</os-build-number>
<os-product>Windows Server 2019</os-product>
<os-service-pack>None</os-service-pack>
</machine-info>
</INSTANCE>
</BP>
!SAC>
SYSTEM_THREAD_EXCEPTION_NOT_HANDLED
0xFFFFFFFF80000003
0xFFFFF80137EB3B7B
0xFFFFDB8C5996F388
0xFFFFDB8C5996EBD0
The Cloud Hypervisor process keeps hanging and doesn't terminate. To reproduce, it's just about booting the guest and then hitting the shutdown button. This issue doesn't happen with OVMF.
As OVMF is currently used for the tests and seems to be the most stable option, we should first clarify on the priority switching to hypervisor-fw.
The guest will need to be debugged the usual way, in first place to identify the issue. Any hints to debug on the firmware side might be helpful, too.
@weltling MSHV or KVM? I know we test RFW against Windows on its CI.
(But we might not test shutdown.)
The description is about KVM, with MSHV looks same code 0x7E and exception:
0xFFFFFFFF80000003
0xFFFFF8015DC4FB7B
0xFFFFEF03AB0BD388
0xFFFFEF03AB0BCBD0
We indeed don't explicitly test shutdown in the CH integration tests, it's always just waiting 1 minute and then killing the guest. I've got at least one similar issue at shutdown to report to CH (not hypervisor-fw related), but digging yet.
I'll try to run the integration tests replacing with the latest hypervisor-fw yet.
Thanks
I patched the script locally to pick hypervisor-fw instead of OVMF and invoked the integration test suite under KVM - it doesn't show any firmware specific issues. As expected, this issue is not caught by the tests. It might be, that test_windows_guest_netdev_hotplug is a bit unstable, but it's not relevant for this particular report.
Given OVMF is currently used, we need to clarify on the priority switching to hypervisor-fw. A work item to be separated from here could be to add an explicit shutdown test to CH integration suite. While shutdown crashes are probably not that bad, still nice to be fixed.
Thanks
With debugger attached, I can see two crashes.
- Happens at boot, most likely a timing issue:
0: kd> k
# Child-SP RetAddr Call Site
00 fffff803`18855b78 fffff803`165ec8e8 nt!DbgBreakPointWithStatus
01 fffff803`18855b80 fffff803`1662ed06 nt!KdCheckForDebugBreak+0x928c0
02 fffff803`18855bb0 fffff803`164cb3f4 nt!KeAccumulateTicks+0x1607d6
03 (Inline Function) --------`-------- nt!KiUpdateRunTime+0x43
04 (Inline Function) --------`-------- nt!KiUpdateTime+0x42a
05 fffff803`18855c10 fffff803`16e88332 nt!KeClockInterruptNotify+0x604
06 (Inline Function) --------`-------- hal!HalpTimerClockInterruptEpilogCommon+0xe
07 (Inline Function) --------`-------- hal!HalpTimerClockInterruptCommon+0xdc
08 fffff803`18855f30 fffff803`16425c65 hal!HalpTimerClockInterrupt+0xf2
09 fffff803`18855f60 fffff803`165d03ca nt!KiCallInterruptServiceRoutine+0xa5
0a fffff803`18855fb0 fffff803`165d0917 nt!KiInterruptSubDispatchNoLockNoEtw+0xfa
0b fffff803`18846590 fffff803`16ea09cf nt!KiInterruptDispatchNoLockNoEtw+0x37
0c fffff803`18846728 fffff803`1659c816 hal!HalProcessorIdle+0xf
0d fffff803`18846730 fffff803`164cd1bb nt!PpmIdleDefaultExecute+0x16
0e fffff803`18846760 fffff803`164cc96f nt!PpmIdleExecuteTransition+0x6bb
0f fffff803`18846a80 fffff803`165d23bc nt!PoIdle+0x33f
10 fffff803`18846be0 00000000`00000000 nt!KiIdleLoop+0x2c
This one seems to happen because the boot went too slowly and ticks expire too fast. Continuing through this one seems to get the system going, though.
- This one is at shutdown.
00 fffff803`18859c10 fffff803`16eb5da4 hal!HalpPowerWriteResetCommand+0x10f
01 fffff803`18859c50 fffff803`16eb7381 hal!HalpInterruptResetThisProcessor+0x164
02 fffff803`18859c80 fffff803`16ebef4a hal!HalpInterruptRebootService+0x41
03 fffff803`18859cb0 fffff803`166a21d0 hal!HalpPreprocessNmi+0x2a
04 fffff803`18859ce0 fffff803`165d9c02 nt!KiProcessNMI+0x30
05 fffff803`18859d30 fffff803`165d99c6 nt!KxNmiInterrupt+0x82
06 fffff803`18859e70 fffff803`16eb7bba nt!KiNmiInterrupt+0x206
07 ffff998d`e405f720 fffff803`16eb7863 hal!HalpShutdown+0x2a
08 ffff998d`e405f780 fffff803`16eb7a5e hal!HalReturnToFirmware
09 ffff998d`e405f7b0 fffff803`169805ce hal!HalpLegacyShutdown+0xe
0a ffff998d`e405f7e0 fffff803`1698033a nt!PopHandleNextState+0x1ee
0b ffff998d`e405f830 fffff803`16980030 nt!PopIssueNextState+0x1a
0c ffff998d`e405f860 fffff803`16995010 nt!PopInvokeSystemStateHandler+0x29c
0d ffff998d`e405fa70 fffff803`16993c1a nt!PopShutdownSystem+0x8c
0e ffff998d`e405fab0 fffff803`1650320a nt!PopGracefulShutdown+0x2ea
0f ffff998d`e405faf0 fffff803`164709d5 nt!ExpWorkerThread+0x16a
10 ffff998d`e405fb90 fffff803`165d5e3c nt!PspSystemThreadStartup+0x55
11 ffff998d`e405fbe0 00000000`00000000 nt!KiStartSystemThread+0x1c
Both cases seem to land in the code path invoking DbgBreakPoint(), whereby the second one is conditioned with the firmware being EFI. Also, i don't seem to run into the second code path at all with OVMF. Perhaps comparing on what exactly is provided vs. used wrt hypervisor-f and OVMF could help, too, as how it still looks like the issue is firmware dependent.
Thanks