illumos-joyent icon indicating copy to clipboard operation
illumos-joyent copied to clipboard

panic on boot with "XHCI runtime reset required"

Open ingenthr opened this issue 8 years ago • 14 comments

Upgraded recently to 20170303 and then to 20170315, but on boot and before the zvol is up it seems (based on my attempt to do a systemdump) I get a panic.

WARNING: xhci1: abort command timed out: resetting device
panic[cpu2]/thread=ffffff001ea81c40: XHCI runtime reset required

Warning - stack not written to the dump buffer
ffffff001ea81b60 xhci:xhci_soft_state+37c103b2 ()
ffffff001ea81c20 genunix:taskq_thread+2d0 ()
ffffff001ea81c30 unix:thread_start+8 ()

(the above is typed by hand, as I unfortunately have no serial on this system. typos are all my own)

I received some help from bahamat in #smartos who suggested filing here. I'll see if I can get some more info out of kmdb on what's in scope, etc.

ingenthr avatar Mar 22 '17 23:03 ingenthr

Update: poking around with my limited mdb, I don't see that thread on any of the CPUs any longer. Let me know if there is any info I can gather from the system to help with diagnosis. Thanks in advance.

ingenthr avatar Mar 23 '17 00:03 ingenthr

The most useful starting point is to run the following kinds of commands:

::stacks -m xhci
::stacks -m usba
::prtusb

Can you also provide any info about what kind of system you're using?

rmustacc avatar Mar 23 '17 00:03 rmustacc

The system is an Intel X79 chipset that I assembled specifically to run SmartOS years ago. It's a Gigabyte X79-UD5.

stacks xhci stacks usba prtusb

I believe only a USB 3 port has a USB 2 drive attached to it: the boot media.

I also grabbed ::vars from the thread address. Let me know if there's anything you need there.

ingenthr avatar Mar 23 '17 05:03 ingenthr

The BIOS had a setting for XHCI handoff, which was enabled. I disabled it, but no change. I was also able to disable XHCI though, which let me work around it for now. I'd still like to help you get to the bottom of it if useful though.

ingenthr avatar Mar 23 '17 05:03 ingenthr

We definitely want to get to the bottom of this. Two things that'll be useful here, could you run the ::prtusb when the system has xhci disabled, just so we can compare. From there, the next thing that's going to be useful is the next time this happens, run the same ::stacks -m xhci and then take the thread address that it displays and run ::findstack -v. So in this case we'd run ffffff001e928c40::findstack -v. Note, the thread address will almost certainly end up changing on the next boot, so you can't use the exact address I have there verbatim. This'll help tell us what it's hanging on trying to enable and then we can start figuring out what's going on.

rmustacc avatar Mar 23 '17 15:03 rmustacc

Great, will get you that info when I am able. It'll probably be later this evening.

ingenthr avatar Mar 23 '17 17:03 ingenthr

So, interesting observation just now… there's no problem if USB3/XHCI is enabled in the BIOS but the boot device is on the USB 2.0 controller.

Booting with it on the USB3 just to get info… findstack

And then booting with xhci enabled and the boot media plugged into USB 2…

# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba xhci mm stmf_sbd stmf zfs sd lofs idm sata random cpc logindmux ptm sppp nfs ]
> ::prtusb
INDEX   DRIVER      INST  NODE          GEN  VID.PID     PRODUCT             
1       xhci        0     pci1458,5007  3.0  0000.0000   No Product String
2       xhci        1     pci1458,5007  3.0  0000.0000   No Product String
3       ehci        0     pci1458,5006  2.0  0000.0000   No Product String
4       ehci        1     pci1458,5006  2.0  0000.0000   No Product String
5       hubd        0     hub           2.0  8087.0024   No Product String
6       hubd        1     hub           2.0  8087.0024   No Product String
7       hid         0     keyboard      1.0  0430.0005   No Product String
8       usb_mid     0     device        1.1  0cf3.3008   
Bluetooth USB Host Controller
9       scsa2usb    0     storage       2.0  058f.6387   Mass Storage

Note I booted under mdb, but since I didn't crash I just grabbed it from multiuser.

I'd booted from this same USB drive on older SmartOS releases in a USB3 port, so something has changed.

And… of course… there's no irony at all in the USB drive I selected: drive-side1 drive-side2 drive-opensolaris

ingenthr avatar Mar 24 '17 05:03 ingenthr

@rmustacc any additional info I can assist with? no rush from my perspective, but just wanted to be sure you have what you need.

ingenthr avatar Mar 31 '17 23:03 ingenthr

@rmustacc anything I can do to help diagnose and/or any fixes? thanks in advance

ingenthr avatar Jul 10 '17 03:07 ingenthr

Updated to 20211216T012707Z. Booting with OS media in USB 2 and a USB 3 attached disk on one of the USB 3.0 ports, I continue to see…

WARNING: xhci1: abort command timed out: resetting device
panic[cpu0]/thread=fffffe005c287c20: XHCI runtime reset required.

image

OCR'd: [8]>::stacks -m xhci

fffffe885c14cc28 SLEEP CV sutch+0x133 cv_wait +8x68 xhci xhci_command_submit+0x12b xhci xhci_command_enable_slot+8x4e xhci xhci_hcd i_device_init+0x1b3 usba hubd_create_child+0x243 usba hubd_handle_port_connect+0x482 usba hubd_hotplug_thread+8x3d3 taskq_d_thread+8xbc thread_start+0xb

[8]> fffffe885c14cc28: :findstack -v stack pointer for thread fffffe885c14cc28 (tq:system_taskq): fffffe805c14c630 [ fffffe805c14c630 _resume_from_idle+0x12b() fffffe885c14c668 sutch+8x133() ] fffffe805c14c6a8 cv_uait+0x68(fffffe805c14c728, fffffe430b9672f8) fffffe805c14c6f8 xhci xhci_command_submit+0x12b (fffffe438b966080, fffffe805c14c710) fffffe805c14c778 xhci xhci_command_enable_slot+8x4e (fffffe438b966800, fffffe43197bd012) fffffe805c14c878 xhci xhci_hcdi_device_init+8x1b3(fffffe4319638a88, 3, fffffe805c14c948) fffffe885c14ca18 usba hubd_create_child+8x243(fffffe43153326a8, fffffe4318b895c8, fffffe4318e68b88, 4, 3, 8) fffffe805c14cabo usba hubd_handle_port_connect+0x482(fffffe4318b895c8, 3) fffffe805c14cb60 usba hubd_hotplug_thread+0x3d3(fffffe4318de9ac8) fffffe805c14cc00 taskq_d_thread+8xbc (fffffe4315f53720) fffffe805c14cc18 thread_start+8xb()

ingenthr avatar Dec 28 '21 19:12 ingenthr

@rmustacc any details I can get to understand this issue better? I'm glad to go in and get some additional information or pair up and do so if it'd help. Feel free to give me some pointers to what source and what kind of poking around with mdb would be of help.

ingenthr avatar Jan 25 '22 05:01 ingenthr

Sorry to make you reproduce this again, but seeing the function stack arguments via$C from kmdb on the actual panicking thread will help correspond what threads are doing what.

danmcd avatar Jan 31 '22 18:01 danmcd

@danmcd no worries at all-- glad to repro it as many times as needed to try to fix it. I'll get some more info here soon and report back.

ingenthr avatar Jan 31 '22 22:01 ingenthr

NOTE: I'm upstreaming https://www.illumos.org/issues/14464 to make whatever we find here a completely a generic illumos fix (not that I think 14464's code from SmartOS is causing this problem... it just eliminates all doubt if we upstream it).

danmcd avatar Feb 02 '22 19:02 danmcd