rpi-6.12.y: LAN7800 doesn't enumerate on Rpi 3B+ in specific scenario
Describe the bug
Hi, first of all I would like to apologize for cross-posting upstream issues, but I am currently stuck with this DWC2 issue and would like to make some progress with the suspend to idle support.
Currently rpi-6.12.y is affected by this problem and I could already reproduce that if you start a Raspberry Pi 3 B Plus without USB peripherals only with debug UART the LAN7800 chip does not enumerate. I was hoping that someone of you could give some hints to analyze the root cause. Of course it is easy to revert the offending commit in question, but I consider the no_clock_gating setting to be valid.
Steps to reproduce the behaviour
- build arm64 kernel for Raspberry Pi 3B+ with bcm2711_defconfig and install it on SD card
- enable debug UART and DWC2 host in config.txt
- disconnect all USB peripheral from Raspberry Pi 3B+
- power on Raspberry Pi 3B+
- run a tool like lsusb to verify that LAN7800 doesn't get enumerated
Device (s)
Raspberry Pi 3 Mod. B+
System
vcgencmd version Sep 13 2024 16:00:01 Copyright (c) 2012 Broadcom version ddfba3e3c234500025b545512b4b214f28e453e9 (clean) (release) (start)
uname -a Linux raspberrypi 6.12.0-rc2-v8+ #8 SMP PREEMPT Fri Oct 11 10:41:15 CEST 2024 aarch64 GNU/Linux
Logs
lsusb Bus 001 Device 003: ID 0424:2514 Microchip Technology, Inc. (formerly SMSC) USB 2.0 Hub Bus 001 Device 002: ID 0424:2514 Microchip Technology, Inc. (formerly SMSC) USB 2.0 Hub Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Additional context
No response
@P33M You know USB and DWC2 best. Any thoughts?
A side-effect of disabling clock gating is that the host port is no longer forcibly suspended/unsuspended as part of the enter/exit clock gating routines. This sounds suspiciously like the LAN7800 can't handle being suspended.
I wonder if setting the RESET_RESUME quirk for the Microchip hubs will kick it out of being uncommunicative?
I see that later in the mailing list thread that HUB_QUIRK_DISABLE_AUTOSUSPEND is used - that's not ideal as it will match hubs on Raspberry Pi 1, 2 and 3 devices - increasing power consumption if Ethernet is not in use.
Thanks for the hint and the quirk avoided the issue.
But I like to come back to original / underlying problem, that the complete USB bus goes into autosuspend (port still powered, but no USB IRQs) and can only be waken up via sysfs, not via connecting a USB device. This problem can still be reproduced with a Raspberry Pi 3 A+ and a USB hub. This was the reason to choose HUB_QUIRK_DISABLE_AUTOSUSPEND, which i'm aware not a good solution.
Is this caused by the lack of a real USB PHY driver in Linux? Or is the lack of runtime power management in the DWC2 driver?
Sorry, i'm a little bit lost in the complexity.
Ah, if connect events on the hub don't cause a remote wake, dwc2 doesn't appear to handle resume properly.
With the root port in suspend, what's the state of the debugfs regdump with the clock gating commit removed/applied? e.g. /sys/kernel/debug/usb/1000480000.usb# cat regdump on a Pi 5
I dumped the DWC2 register on a Raspberry Pi 3A+ with a USB 2.0 hub connected after boot. I hope it's okay that i made a diff between good (clock gating) and bad case (no clock gating): https://gist.github.com/lategoodbye/f22f97b379de8777176cf90113fb10e2
@P33M Is there anything useful in this dump?
I scribbled some notes down then forgot about them:
HCFG:
No 32khz suspend clock
FS/LS PHY clock is 30/60MHz
HPRT0
no_cg:
enable, connected, powered
cg:
enable, connected, powered, suspended, pls=D+ high (correct, bus returns to FS termination)
PCGCTL:
no_cg:
nothing set
cg:
0x11 = PHY suspended, stop_pclk set
Programming model for powerdown:
- Set port suspend bit in hprt0
- set power clamps (there aren't any on bcm2835)
- stop PHY clock in PCGCTL
- Some blurb about associated platform power management
It may be the case that remote resume won't work because I don't think bcm2835 has a slow alternate PHY clock
Programming model for powerup:
- Clear stop phy clock bit
- Clear power clamps (not applicable)
- Application sets Port Resume in HPRT0
- waits 20ms
- Clears Port Resume in HPRT0
- Port should be available again?
So the question is, does dwc2 do the powerup sequence including forcing downstream resume?
Thanks. From my understanding DWC2 is completely interrupt driven. There are two interrupt handler in host mode (HCD of USB core + DWC2 driver), which shares the same interrupt. In the bad case this interrupt doesn't fire anymore on USB dis/connect ( no changes under /proc/interrupt ), so I don't see a chance how DWC2 can wakeup itself?
Can you please tell me which interrupt cause is relevant in this case (port interrupt)?
I will try to translate your last question in DWC2 code. Is _dwc2_hcd_resume called?
Here are the bad case logs (including debug for function call) from the linux-usb list:
[ 2.334366] dwc2 3f980000.usb: supply vusb_d not found, using dummy
regulator
[ 2.341892] dwc2 3f980000.usb: supply vusb_a not found, using dummy
regulator
[ 2.400027] dwc2 3f980000.usb: DWC OTG Controller
[ 2.404868] dwc2 3f980000.usb: new USB bus registered, assigned bus
number 1
[ 2.412087] dwc2 3f980000.usb: irq 51, io mem 0x3f980000
[ 2.711826] usb 1-1: new high-speed USB device number 2 using dwc2
[ 3.195838] usb 1-1.1: new high-speed USB device number 3 using dwc2
[ 3.435829] dwc2 3f980000.usb: dwc2_port_suspend
[ 3.459914] dwc2 3f980000.usb: _dwc2_hcd_suspend
[ 9.009743] dwc2 3f980000.usb: _dwc2_hcd_resume
[ 9.030667] dwc2 3f980000.usb: dwc2_port_suspend
[ 9.044137] dwc2 3f980000.usb: _dwc2_hcd_suspend
[ 9.044222] dwc2 3f980000.usb: _dwc2_hcd_resume # this suspend & resume cycle is just triggered by USB_ONBOARD_DEV and not related
[ 9.354370] usb 1-1.1: new high-speed USB device number 4 using dwc2
[ 9.584095] dwc2 3f980000.usb: dwc2_port_suspend
[ 9.599997] dwc2 3f980000.usb: _dwc2_hcd_suspend # this the last log from DWC2 after being stuck
So in order to answer your question: I would say no
The two logs look odd - the bus is addressed twice in the first case and just device 1-1.1 is addressed twice in the second. But even so, I would expect a resume to be signalled at the root port after 1-1.1 is suspended for the second time, and picked up by dwc2, but that doesn't happen.
My guess is the lack of slow PHY clock breaks remote wakeup. What happens if you never set the PCGCTL.STOPPCLK bit?
Not sure that I can follow your suggestion because in bad case (no_clockgating) the PCGCTL.STOPPCLK is already never set.
Is it possible you got confused by the color highlighting (bad = green, good = red in this case)?
So I think something from dwc2_host_enter_clock_gating() should also be done in the no_clockgating case?
@P33M Gentle ping ...
I'm at a loss to explain why this doesn't work. I do have access to the databook but without the implementation detail for the associated PHY (which is a Broadcom one) then I don't have any more ideas.
Thanks for the feedback so far. So at least this isn't a obvious issue.