qubes-issues icon indicating copy to clipboard operation
qubes-issues copied to clipboard

Hardware reset during installation and boot of R4.2 on Ryzen 9 7950X

Open Eric678 opened this issue 2 years ago • 22 comments

How to file a helpful issue

Qubes OS release

R4.2.0-rc1 + Ryzen 9 7950X + Gigabyte X670E motherboard

Brief summary

Installation proceeds normally till just after "Configure networking" when hardware resets. Further system boots reset just after entering disk password.

Steps to reproduce

Run a default installation of R4.2.0-rc1.

Expected behavior

No hardware resets.

Actual behavior

As noted.

Problem appears to be caused by a single USB controller being mapped into sys-usb. There are 5 USB controllers on the CPU and 670 chipset, only one causes a problem. It is the last one in the devices list, address 37:00.0.

Workaround is to add qubes.skip_autostart option to the linux kernel boot parameters at any boot after installation, then unmap this controller from sys-usb once system is up.

I suspect that it is the on CPU controller that is used for the mouse and keyboard as others on different VM systems on the same CPU have a problem mapping running USB devices causing a hardware reset.

Eric678 avatar Jul 04 '23 19:07 Eric678

I suspect this will need a hardware quirk in the installer.

DemiMarie avatar Jul 05 '23 04:07 DemiMarie

A simpler workaround turns out to be leaving IOMMU disabled during installation (the above MB defaults to auto and does not know about Qubes) then installation exits seconds before getting the hardware reset with a missing IOMMU error starting sys-firewall - presuming it was sys-net actually. Installation exits cleanly and one can immediately log in and remove the USB controller from sys-usb. I have no idea what I am missing out on in the install, this technically invalidates all further testing of R4.2. It does seem to work rather well actually...

Eric678 avatar Jul 06 '23 23:07 Eric678

Quick check on rc3 and still there, however a clean install can be made by adding the "qubes.skip_autostart" option to vmlinuz on 2nd pass of installation. The installer does take notice, oddly sys_usb is not started and sys_firewall & sys_net are, probably a bug. Just take last USB controller out of sys_usb and start it and proceed as normal. Only problems I am having with rc3 is with USB devices being a bit flakey, may be related to whatever this problem is.

Eric678 avatar Sep 23 '23 22:09 Eric678

How would one add the needed quirk to Anaconda?

DemiMarie avatar Sep 23 '23 23:09 DemiMarie

Is there a phase during installation where the installer boots sys-usb after assigning all usb devices to it?

0spinboson avatar Sep 24 '23 07:09 0spinboson

How would one add the needed quirk to Anaconda?

I don't think it's the right thing to do, at least with the current info here. It would potentially leave dom0 exposed to some USB devices, while user would have impression they are all isolated in sys-usb (since that was selected during install). The proper solution is ofc make it not crash. But as a workaround user can choose to not create sys-usb during install, and later create it by hand and remove the device from there. This way they will know some device is excluded and there is no risk of leaving it in dom0 without user knowledge. Such instruction should also explain the risk.

But, if the device really should stay in dom0, not as a workaround for a crash, but as really intended behavior, then we have a mechanism for that - rd.qubes.dom0_usb=37:00.0 (example value) option to the kernel. It will leave this controller in dom0, and also salt will respect this setting when creating sys-usb. It can be added to the kernel at the start of installation in grub menu (anaconda will carry the kernel option to the final system too), or maybe somewhere within anaconda automatically (of which I'm very much not convinced it's the right thing to do).

marmarek avatar Sep 24 '23 10:09 marmarek

Has this been reported to Gigabyte? I wonder if SMM is getting an interrupt it did not expect to get and crashes as a result.

DemiMarie avatar Sep 24 '23 15:09 DemiMarie

How would one add the needed quirk to Anaconda?

I don't think it's the right thing to do, at least with the current info here. It would potentially leave dom0 exposed to some USB devices, while user would have impression they are all isolated in sys-usb (since that was selected during install). The proper solution is ofc make it not crash. But as a workaround user can choose to not create sys-usb during install, and later create it by hand and remove the device from there. This way they will know some device is excluded and there is no risk of leaving it in dom0 without user knowledge. Such instruction should also explain the risk.

What if the device was attached to nothing? Don’t assign it to sys-usb, but don’t assign it to any other qube (including dom0) either. Assign it to Xen’s quarantine domain. That might avoid the crash without the security consequences.

Alternatively, what if Linux is told to not reset the device? I wonder if Linux sees that a PM reset is available, but that PM reset winds up resetting the whole system.

DemiMarie avatar Sep 24 '23 15:09 DemiMarie

Alternatively, what if Linux is told to not reset the device? I wonder if Linux sees that a PM reset is available, but that PM reset winds up resetting the whole system.

That's highly unlikely. A much more likely cause is either dom0 or xen panic...

And still, I don't want wasting time on elaborate workarounds (there are already a few simple ones in this thread), until we know for sure proper fix is not achievable.

marmarek avatar Sep 24 '23 15:09 marmarek

Alternatively, what if Linux is told to not reset the device? I wonder if Linux sees that a PM reset is available, but that PM reset winds up resetting the whole system.

That's highly unlikely. A much more likely cause is either dom0 or xen panic...

And still, I don't want wasting time on elaborate workarounds (there are already a few simple ones in this thread), until we know for sure proper fix is not achievable.

Is “assign to quarantine domain” simple or elaborate?

DemiMarie avatar Sep 24 '23 16:09 DemiMarie

This is reproducible on my 7950X with an Asus Strix X670E-F, so I don't thnk it's Gigabyte-specific. I also have a 7900XTX which may not be helping things.

brxken128 avatar Sep 26 '23 06:09 brxken128

Also happens to me on 7950X with Asrock X670E Steel Legend. I have two USB controllers that cause a reboot -- 16:00.4 and 17:00.0

Tehvan avatar Oct 16 '23 22:10 Tehvan

I have the same issue with my Asus Strix X670E-F. I have one "USB controller" that always cause a reboot : 12:00.0

However I am not sure of what it is really. I tried every USB port on my setup, everything work, without this "USB controller".

( I have two unused internal USB 2.0 port on my motherboard. I have one USB controller that I can passthrough in qubes os, but this controller never receive any usb device, I suspect it is the USB controller for my two unused internal USB 2.0 port. )

For the peoples having this issue, are you missing any USB port / functionality without the "USB controller" that you cannot passthrough ?

Result of "sudo lsci -vvs 12:00.0"

12:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15b8 (prog-if 30 [XHCI])
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 46
	Region 0: Memory at fc000000 (64-bit, non-prefetchable) [size=1M]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D3 NoSoftRst- PME-Enable+ DSel=0 DScale=0 PME-
	Capabilities: [64] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x16
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [a0] MSI: Enable- Count=1/8 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
		Vector table: BAR=0 offset=000fe000
		PBA: BAR=0 offset=000ff000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [270 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [2a0 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [450 v1] Lane Margining at the Receiver <?>
	Kernel driver in use: xhci_hcd
	Kernel modules: xhci_pci

The uncommon lines in this:

  • "Status: D3 NoSoftRst- PME-Enable+ DSel=0 DScale=0 PME-"
  • No "Latency" line

neowutran avatar Oct 19 '23 21:10 neowutran

On mine the 17.00.0 is the Motherboard LED controller. But since there is no problem when not using sys-usb, it should be a passthrough problem (i.e. iommu groups) right?

Tehvan avatar Oct 21 '23 05:10 Tehvan

iommu groups or soft reset

0spinboson avatar Oct 21 '23 07:10 0spinboson

4.2-rc4 6.5.6 still there. Behavior is different - normal install, I left machine for 2nd pass and when I returned much later it was shut down. Bringing it up with qubes.skip_autostart there were 3 USB controllers in sys-usb that were unknown and all had to be removed for it to start. Guessing not everything made it disk before the reset. 2nd try with qubes.skip_autostart to 2nd pass, completed the Anaconda progress bar, dropped back to console, finished systemd-tmpfiles-clean.service, then stuck at "Job initial-setup.service/start running" for a couple of hours before I reset the machine. Took last USB controller out of sys-usb and all seemed OK. All USB ports appear to be working (13 exposed on outside of motherboard including mouse and keyboard + 1 I am using on the motherboard internally). There is definitely a problem with writing USB storage devices that I will post separately.

[ed] While writing up that issue I had a different event: an instant power off while typing here. Had been doing various testing on USB ports and had left a storage device plugged into one of the controllers on the 670 chipset. On trying to boot I got the same power off after entering the disk password, suspecting sys-usb, I took a couple more devices out and could then get up and running and then noticed the USB drive on the back panel, removed it and could put those devices back in sys-usb and boot OK. So it looks like all it takes is for a device to be plugged into a port that is mapped to sys-usb to cause a reset or power off on start. I did plug the mouse and keyboard into the only 2 ports that are USB 2.0/1.1 that are on a USB 2.0 hub direct on the CPU, hence my original suspicion.

Eric678 avatar Oct 21 '23 21:10 Eric678

rc5-latest test did not get very far: debian-12-xfce: qubes.PostInstall service failed. See attached. No other reports? Media OK. Installing encrypted on SATA SSD while another copy (current stable) encrypted on different drive. This worked above for rc4. 20231203

Eric678 avatar Dec 04 '23 05:12 Eric678

4.2.0 6.6.2 did not have above installation problem. Still get a power off starting sys-usb if the last USB device is mapped. Not getting the power off/reset if a storage device is plugged into another controller when sys-usb is started, however sys-usb does go into a loop: device available, device removed notifications every second that is cleared by removing the storage device. Note sys_net and sys_firewall are autostarted even if qubes.skip_autostart is passed to the kernel.

Eric678 avatar Dec 30 '23 04:12 Eric678

I can see the same on Supermicro M11SDV-4C-LN4F, here's log from serial from attempted boot that resulted in hard restart: xen.log

No panic, nothing unexpected in the last lines. I'm not sure why first lines (5th and 6th) look as they do. I had issue with another Supermicro board (X11-something) where the output was heavily modified by BMC (lines printed out of order with heavy jumping with ANSI escape codes, \n without \r or \n after each character depending on BIOS settings etc.), but here everything seems to work reliably, except those two lines.

I can start the OS with qubes.skip_autostart and sys-usb starts only with USB controller disabled. Unfortunately, this platform has just one controller and most likely I'll need it at some point.

krystian-hebel avatar Jan 22 '24 16:01 krystian-hebel

The same issue applies to Legion 5 Pro: The USB controller has its own IOMMU group without any other device. You can check the IOMMU grouping in the Legion 5 Pro HCL.

mahakal avatar Oct 10 '24 15:10 mahakal

Just confirm quick test of 4.2.3 latest kernel installation - problem still present. Passing qubes.skip_autostart allows P2 instal to complete (4.2.2 hung on the console after Anaconda). sys-net & firewall still start when they should not. After up and running, moving the last USB controller back into sys-usb caused a device not found error: two of the other controllers somehow changed their device numbers while sys-usb was restarted. Moving the updated devices in then caused the hardware reset on starting sys-usb.

Eric678 avatar Oct 12 '24 05:10 Eric678

There are multiple problems here:

  • Qubes OS needs to know that passing certain USB controllers into a guest won’t work.
  • Device assignment MUST be done using something (such as PCI location path) that is persistant across reboots, NOT bus/slot/function.

DemiMarie avatar Oct 15 '24 22:10 DemiMarie

Can confirm this issue is present on latest QubesOS release 4.2.3 with latest kernel on this system: MSI Tomahawk X670E WIFI AMD Ryzen 9 7950X

disabling autostart was required for a successful installation, otherwise system would hang during sys-usb start

inao-cz avatar Nov 15 '24 09:11 inao-cz

I could install on the MSI Tomahawk X670E with BT+WiFi disabled in the firmware, without having to disable autostart.

renehoj avatar Nov 15 '24 15:11 renehoj

I started having this issue after moving from the 6.6.x kernel to the 6.12.x kernel.

Motherboard: ASRock B650M-HDV/M.2 CPU: AMD Ryzen 7800X3D

I tried updating the BIOS to the latest from 2024-12-16. This resulted in the behavior changing from a hardware reset to a hard freeze.

Fix:

I found a setting in the BIOS called "XHCI Hand-off" which the description described as a "workaround" and was enabled. After disabling it, sys-usb claims all USB controllers successfully now. With an interesting note in dmesg:

[ 0.577659] pci 0000:00:09.0: quirk_usb_early_handoff+0x0/0x170 took 16629 usecs

This was the 1 (out of 4) USB controller that couldn't be claimed before disabling the XHCI hand-off "workaround", there is no message for the others. But now they all work.

tcosprojects avatar Feb 26 '25 03:02 tcosprojects

I started having this issue after moving from the 6.6.x kernel to the 6.12.x kernel.

Motherboard: ASRock B650M-HDV/M.2 CPU: AMD Ryzen 7800X3D

I tried updating the BIOS to the latest from 2024-12-16. This resulted in the behavior changing from a hardware reset to a hard freeze.

Fix:

I found a setting in the BIOS called "XHCI Hand-off" which the description described as a "workaround" and was enabled. After disabling it, sys-usb claims all USB controllers successfully now. With an interesting note in dmesg:

[ 0.577659] pci 0000:00:09.0: quirk_usb_early_handoff+0x0/0x170 took 16629 usecs

This was the 1 (out of 4) USB controller that couldn't be claimed before disabling the XHCI hand-off "workaround", there is no message for the others. But now they all work.

Hi i have excatly same hardware as u, how to download 6.6.x kernel? and which bios version u have?

VDSFXGVBSDFGBDSF avatar Apr 23 '25 19:04 VDSFXGVBSDFGBDSF

For me, Qubes has been stable for the past 2 months with:

ASRock B650M-HDV/M.2/B650M-HDV/M.2 BIOS 3.15 12/10/2024 Linux dom0 6.12.11-1.qubes.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Jan 28 01:03:48 GMT 2025 x86_64 x86_64 x86_64 GNU/Linux

tcosprojects avatar Apr 24 '25 06:04 tcosprojects

I have the same (but at least very similar) issue using the latest .iso (4.2.4) on a ASRock B850 PRO-A + Ryzen 7 9700X see my HCL report for more details: https://forum.qubes-os.org/t/hcl-desktop-pc-asrock-b850-pro-a-ryzen-7-9700x/35358/6

My suspect is the USB Hub that 'holds' the USB-C alt DP port, as it must be connected to the iGPU somehow, right? (testing is still in progress on my hardware)

Zrubi avatar Aug 08 '25 11:08 Zrubi

See earlier comment about "XHCI Hand-off" bios option - does it help? There were also some observations made during debugging of #8794 - for example try detaching all usb devices from that affected USB controller. It should be okay to leave alt-DP connected, but I guess you can try that too (just for the sys-usb startup time, it's okay to connect it back later). It may involve timely unplugging/plugging cables (just after sleep 10; qvm-start sys-usb command for example), but would be useful to know if that changes anything.

marmarek avatar Aug 08 '25 11:08 marmarek

I'm just about to debug those issues... but frankly this process is really annoying if you only have a USB keyborad:

  • change something (with working USB Keyboard)
  • reboot - and very likely lost the keyboard,
  • reboot without autostart enabled,
  • digging into the logs and hope to find something useful
  • repeat.

Zrubi avatar Aug 08 '25 12:08 Zrubi