Intel SoundWire: Glitch on SDW_CLK causes peripherals to drop off bus during runtime resume
There is a spurious pulse on the clock line when exiting clock-stop without a bus reset. The peripherals will interpret this as the first bits of a frame and so they are now out-of-sync, and as per the SoundWire spec this will cause them to revert to device 0 and require re-enumeration. However, the SoundWire driver code is expecting that all the devices are still present and enumerated. Many failures then occur.
Critically, unattach_request is not set, so the codec driver runtime_resume will not wait for initialization_complete or enumeration_complete and any attempt to read/write registers will fail until the peripheral is re-enumerated. So the codec driver fails to runtime_resume.
[ 5.898434] SDW: Invalid device for paging :0
[ 5.898507] cs35l56 sdw:0:0:01fa:3556:01:0: Read failed @0x280a1e0:-22
When the SoundWire controller runtime_resumes there is a check in intel_start_bus_after_reset() whether the Cadence controller ever powered-down:
clock_stop0 = sdw_cdns_is_clock_stop(&sdw->cdns);
If this returns TRUE, it passes bus_reset=false to sdw_cdns_clock_restart(), which then calls sdw_bus_exit_clk_stop() expecting all the peripherals to still be enumerated and halted in clock stop 0. Up to this point they are, but the wakeup of the SoundWire host emits a spurious pulse on CLOCK:
Close-up:
It should look like this:
These two spurious edges will cause the peripherals to clock in data bits for the next frame, but this isn't the start of the next frame. They are now out-of-sync with what the host is sending. When they get to where they expect the sync bits to be, they won't have the correct value and so the peripheral will Soft-Reset (as defined in the SoundWire spec). They will stop responding and will revert to not-enumerated. It takes a further 16 frames for the peripherals to acquire sync, and then they will start responding to PINGs as device 0.
The problem here is that the core SoundWire code and the codec drivers have been told that they are doing a normal exit from clock-stop 0, with all peripherals still attached to the bus. The core SoundWire code will report that:
clock stop deprepare failed:-61
but more seriously the codec driver runtime_resumes will fail to read/write registers and so the state of the peripheral doesn't match what the driver expects.
I've attached a patch that demonstrates this problem with UpX MTL + Cirrus CDB35L56-EIGHT-C board (otherwise known as AIC board).
0001-EXPERIMENTAL-SoundWire-runtime_resume-failure-testin.patch.txt
This adds a periodic delayed_work to the cs35l56 driver. At approximately 1 second intervals it will runtime_resume then runtime_put the driver. It also adds some debug to the Intel and Cadence SoundWire drivers. This quickly gets to a point where the controller for one link is in clock-stop while the controller for the other link is still active, which triggers the bug.
@rfvirgil What error will I see after applying 0001-EXPERIMENTAL-SoundWire-runtime_resume-failure-testin.patch.txt? I test it on my laptop and my topology is sof-arl-cs42l43-l0-cs35l56-l23.tplg which is quite similar to the CDB35L56-EIGHT-C board. I can see the cs35l56 keep runtime suspending and resuming. However, it seems no error occurs. It seems to be repeating the attached dmesg. dmesg_bus_clock.txt
@bardliao I suggest that you try reproducing it with less debug enabled. There are at least two races here, so turning on a lot of debug will make it more likely that the amps are re-enumerated before other code gets around to talking to them.
Also the patch I sent you was designed to get to the error case (bus_reset=0) and then stop with the minimum of extra activity, to make it easier to debug around the point of failure.
Here is a version of that patch that doesn't stop at bus_reset=0.
0001-EXPERIMENTAL-SoundWire-runtime_resume-failure-testin.patch.txt
Leave it running and you should eventually see something like this:
[ 457.372929] soundwire_intel soundwire_intel.link.0: intel_link_power_up
[ 457.373293] soundwire_intel soundwire_intel.link.0: intel_start_bus_after_reset 122 (bus_reset = 0)
[ 457.373299] sdw_cdns_clock_restart 1773
[ 457.373608] cs35l56 sdw:0:0:01fa:3556:01:3: Slave 1 state check1: UNATTACHED, status was 1
[ 457.373622] cs35l56 sdw:0:0:01fa:3556:01:2: Slave 2 state check1: UNATTACHED, status was 1
[ 457.373630] cs35l56 sdw:0:0:01fa:3556:01:1: Slave 3 state check1: UNATTACHED, status was 1
[ 457.373634] cs35l56 sdw:0:0:01fa:3556:01:0: Slave 4 state check1: UNATTACHED, status was 1
[ 457.373662] soundwire_intel soundwire_intel.link.0: intel_start_bus_after_reset 132 (bus_reset = 0)
[ 457.374828] soundwire_intel soundwire_intel.link.1: intel_link_power_up
[ 457.375134] soundwire_intel soundwire_intel.link.1: intel_start_bus_after_reset 122 (bus_reset = 0)
[ 457.375138] sdw_cdns_clock_restart 1773
[ 457.375349] cs35l56 sdw:0:1:01fa:3556:01:7: Slave 1 state check1: UNATTACHED, status was 1
[ 457.375357] cs35l56 sdw:0:1:01fa:3556:01:6: Slave 2 state check1: UNATTACHED, status was 1
[ 457.375391] cs35l56 sdw:0:1:01fa:3556:01:5: Slave 3 state check1: UNATTACHED, status was 1
[ 457.375394] cs35l56 sdw:0:1:01fa:3556:01:4: Slave 4 state check1: UNATTACHED, status was 1
[ 457.375401] soundwire_intel soundwire_intel.link.0: Program SCP Ack not received
[ 457.375632] soundwire_intel soundwire_intel.link.0: Program SCP Ack not received
[ 457.375670] soundwire sdw-master-0-0: trf on Slave 4 failed:-5 read addr a1e0 count 0
[ 457.375715] cs35l56 sdw:0:0:01fa:3556:01:0: Read failed @0x280a1e0:-5
I don't think enabling a lot of debug is relevant. The problem seems to be how the code in sdw_cdns_clock_restart() wakes the Cadence controller from clock-stop.
Actually the most important failure here is all those UNATTACHED.
[ 457.373293] soundwire_intel soundwire_intel.link.0: intel_start_bus_after_reset 122 (bus_reset = 0)
[ 457.373299] sdw_cdns_clock_restart 1773
[ 457.373608] cs35l56 sdw:0:0:01fa:3556:01:3: Slave 1 state check1: UNATTACHED, status was 1
[ 457.373622] cs35l56 sdw:0:0:01fa:3556:01:2: Slave 2 state check1: UNATTACHED, status was 1
[ 457.373630] cs35l56 sdw:0:0:01fa:3556:01:1: Slave 3 state check1: UNATTACHED, status was 1
[ 457.373634] cs35l56 sdw:0:0:01fa:3556:01:0: Slave 4 state check1: UNATTACHED, status was 1
The link did not power-down so we are not doing a bus_reset. This should be a normal exit from clock stop 0. So nothing should be detaching from the bus.
@rfvirgil Sorry for the late reply. So, there are 2 issues, 1 is the codec is unattached unexpectedly, and the other is that the driver doesn't recover from the unattached status. Am I correct?
@bardliao Unexpected unattach -> yes
Which driver are you asking about "doesn't recover" ?
@rfvirgil I think I got your issue now. However, I can't reproduce the issue on my ARL-S device. I do see bus_reset = 0, but not Slave 1 state check1: UNATTACHED, status was 1. I also added cdns_read_ping_status() to get the Peripherals status at the beginning of intel_start_bus_after_reset() and I didn't meet any case that the Peripheral is unattached when ‵bus_reset = 0
Please see my diff and dmesg below.
bard.diff.txt
dmesg_arl-s.txt
For the issue, maybe you can try forcing bus_reset = 1 when any Peripheral is unattached?
I was trying to setup the LNL RVP to see if I could reproduce it with the Realtek codecs. But something is very wrong with the RVP. Any ideas?
This is 6.16-rc7.
It's loading the topology ok:
[ 9.764346] acpi device:28: find_sdca_function: SDCA function UAJ (type 6) at 0x1
[ 9.764353] acpi device:29: find_sdca_function: SDCA function HID (type 10) at 0x3
[ 9.764915] acpi device:2f: find_sdca_function: SDCA function SmartMic (type 3) at 0x2
[ 9.765428] acpi device:2d: find_sdca_function: SDCA function SmartAmp (type 1) at 0x4
[ 9.765917] acpi device:2b: find_sdca_function: SDCA function SmartAmp (type 1) at 0x4
[ 9.778217] sof-audio-pci-intel-lnl 0000:00:1f.3: no hda codecs found!
[ 9.778222] sof-audio-pci-intel-lnl 0000:00:1f.3: NHLT device BT(0) detected, ssp_mask 0x4
[ 9.778224] sof-audio-pci-intel-lnl 0000:00:1f.3: BT link detected in NHLT tables: 0x4
[ 9.778226] sof-audio-pci-intel-lnl 0000:00:1f.3: DMICs detected in NHLT tables: 0
[ 9.785514] sof-audio-pci-intel-lnl 0000:00:1f.3: Firmware paths/files for ipc type 1:
[ 9.785520] sof-audio-pci-intel-lnl 0000:00:1f.3: Firmware file: intel/sof-ipc4/lnl/sof-lnl.ri
[ 9.785522] sof-audio-pci-intel-lnl 0000:00:1f.3: Firmware lib path: intel/sof-ipc4-lib/lnl
[ 9.785524] sof-audio-pci-intel-lnl 0000:00:1f.3: Topology file: intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg
[ 9.787232] sof-audio-pci-intel-lnl 0000:00:1f.3: Loaded firmware library: ADSPFW, version: 2.12.0.1
[ 10.062310] sof-audio-pci-intel-lnl 0000:00:1f.3: Booted firmware version: 2.12.0.1
[ 10.089490] sof-audio-pci-intel-lnl 0000:00:1f.3: Topology: ABI 3:29:1 Kernel ABI 3:23:1
[ 10.095006] sof_sdw sof_sdw: ASoC: Parent card not yet available, widget card binding deferred
[ 10.097118] sof-audio-pci-intel-lnl 0000:00:1f.3: Loaded firmware library: ADSPFW, version: 2.12.0.1
but after that it is very unhappy:
[ 23.299445] rt1316-sdca sdw:0:2:025d:1316:01: DPN_PortCtrl register write failed for port 1
[ 23.299473] soundwire sdw-master-0-2: Program transport params failed: -61
[ 23.299476] soundwire sdw-master-0-2: Program params failed: -61
[ 23.802298] rt715-sdca sdw:0:1:025d:0714:01: DPN_PortCtrl register write failed for port 4
[ 23.802344] soundwire sdw-master-0-1: Program transport params failed: -61
[ 23.802350] soundwire sdw-master-0-1: Program params failed: -61
Any suggestions?
Ah, I think the BIOS ACPI in this RVP doesn't match the actual hardware. It seems to only have one RT711.
@rfvirgil You can disable the SDW links other than link 0 in the BIOS. Then you suppose to have only rt711 in the ACPI table.
For my education @rfvirgil why would you want to restart the clock without a bus reset?
On Intel devices, when you stop the clock power savings will only be reached if the SoundWire IP is power gated, and that means all context is lost. It's really not advised to use the 'regular' clock stop which is known to be problematic.
@plbossart Well, it's the Intel driver that makes the decision to restart without a bus reset. This behavior was introduced by this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/soundwire/intel.c?id=08abad9f45f105e947d91822bdbc47650fbd7773
Draft fix: https://github.com/thesofproject/linux/pull/5591
Unfortunately the draft fix doesn't work. It's apparently not as simple as reverting the patch that bypassed bus_reset. If the link did not power-down it now seems to be in a limbo state where the normal power-up-and-bus-reset code won't work, but we still need to send a bus-reset..
kern :err : [ 51.597221] soundwire_intel soundwire_intel.link.0: Couldn't exit from clock stop
kern :err : [ 51.597365] soundwire_intel soundwire_intel.link.0: unable to restart clock during resume
kern :err : [ 51.597385] soundwire_intel soundwire_intel.link.0: intel_resume_runtime: cannot start bus after reset: -110
@plbossart Well, it's the Intel driver that makes the decision to restart without a bus reset. This behavior was introduced by this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/soundwire/intel.c?id=08abad9f45f105e947d91822bdbc47650fbd7773
Sorry I missed the reply. Note that this should only apply to devices before LNL, and possibly before MTL. in LNL+ devices this behavior should not exist with the entire power structure revisited.