linux [LNL]Suspend resume with playback or capture will Fail or TIMEOUT

Suspend resume with playback or capture will Fail or TIMEOUT

Steps to reproduce

Use NOCODEC/HDA/SDW issue sporadically observed on all devices. below cmd are for nocodec
TPLG=/lib/firmware/intel/development/sof-lnl-nocodec.tplg MODEL=LNLM_RVP_NOCODEC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m playback or TPLG=/lib/firmware/intel/development/sof-lnl-nocodec.tplg MODEL=LNLM_RVP_NOCODEC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m capture

The test will either fail or hang

Logs console_logs.txt dmesg_logs.txt

Configs Linux Commit: d25788054d59 KConfig Branch: master KConfig Commit: c3171afedc63 SOF Branch: stable-v2.10 SOF Commit: b15f1f1a3238

cc:

#5109

Jun 24 '24 09:06 ssavati

Failure logs are a bit unclear w.r.t. where things start to go south. The last messages to FW seem sane, cores are powered down, finally SET_DX is sent without errors. But then there's a generic kernel pm -EBUSY error for late_suspend (one example, Intel test run 43010):

[  408.720709] kernel: snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-lnl 0000:00:1f.3: Current DSP power state: D3
[  408.720720] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-lnl 0000:00:1f.3: fw_state change: 7 -> 0
[  412.467352] kernel: PM: late suspend of devices failed

Running more tests to get insights on the failures.

Jun 24 '24 16:06 kv2019i

Digging further into scenario #5085 addresses, it seems following combination of events happens:

deep buffer PCM enters D0i3x state
xrun happens just at suspend, so arecord is doing xrun handling (PCM is stopped, prepared and started again after resume)
as the PCM is in stopped (due to the xrun), SNDRV_PCM_TRIGGER_SUSPEND is not sent to DAI, and the NULL deref can be hit

Example log

[  882.643397] DEBUG: cmd=1 dai SSP0 Pin direction 0
# KV: deep buffer PCM31 started
[  882.648298] snd_sof:sof_pcm_trigger: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: trigger stream 31 dir 0 cmd 1
[  882.648307] snd_sof:sof_ipc4_trigger_pipelines: sof-audio-pci-intel-lnl 0000:00:1f.3: trigger cmd: 1 state: 4
[  882.648315] snd_sof:sof_ipc4_set_pipeline_state: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc4 set pipeline instance 0 state 3
[  882.648322] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x13000003|0x0: GLB_SET_PIPELINE_STATE
[  882.648759] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x33000000|0x0: GLB_SET_PIPELINE_STATE
[  882.648788] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x13000003|0x0: GLB_SET_PIPELINE_STATE
[  882.648795] snd_sof:sof_ipc4_set_pipeline_state: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc4 set pipeline instance 0 state 4
[  882.648800] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x13000004|0x0: GLB_SET_PIPELINE_STATE
[  882.649963] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x33000000|0x0: GLB_SET_PIPELINE_STATE
[  882.649985] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x13000004|0x0: GLB_SET_PIPELINE_STATE
[  882.649997] snd_sof_intel_hda_common:hda_dsp_stream_trigger: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x1e0]=0x2014001e successful
[  884.689120] asix 3-3:1.0 enx34298f712491: Link is Up - 100Mbps/Full - flow control rx/tx
# KV: opportunistic D0iX is attempt at 887sec
[  887.690497] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x48000000|0x12: MOD_SET_D0IX
[  887.691066] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x68000000|0x12: MOD_SET_D0IX
[  887.691109] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x48000000|0x12: MOD_SET_D0IX
[  887.691117] snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-lnl 0000:00:1f.3: Current DSP power state: D0I3
# KV: suspend process is started soon after
[  887.983323] PM: suspend entry (s2idle)
[  887.985028] Filesystems sync: 0.001 seconds
# KV: user-space see xruns and stops the PCM as part of xrun handlng, user-space not yet frozen
[  888.121816] snd_sof:sof_pcm_trigger: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: trigger stream 31 dir 0 cmd 0
[  888.121825] snd_sof:sof_ipc4_trigger_pipelines: sof-audio-pci-intel-lnl 0000:00:1f.3: trigger cmd: 0 state: 3
[  888.121830] snd_sof:sof_ipc4_set_pipeline_state: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc4 set pipeline instance 0 state 3
[  888.121915] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x48000000|0x4: MOD_SET_D0IX
[  888.122275] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x68000000|0x4: MOD_SET_D0IX
[  888.122288] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x48000000|0x4: MOD_SET_D0IX
[  888.122290] snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-lnl 0000:00:1f.3: Current DSP power state: D0I0
[...]
[  889.222280] snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget pipeline.2 freed
[  889.222285] snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget dai-copier.SSP.NoCodec-0.playback freed
# KV: much later hda_dai_suspend() is called with hext_stream non-NULL and this will kill the kernel
[  889.262695] DEBUG: dai suspend 13a0f000

PR #5085 will avoid the kernel crash, but the xrun handling concurrently with system suspend is not successful and the streaming later terminates to:

[  893.639609] snd_sof_intel_hda_common:hda_dai_trigger: sof-audio-pci-intel-lnl 0000:00:1f.3: cmd=1 dai SSP0 Pin direction 0
[  893.639621] sof-audio-pci-intel-lnl 0000:00:1f.3: ASoC: error at soc_dai_trigger on SSP0 Pin: -22
[  893.648704]  Deepbuffer Port0: ASoC: trigger FE cmd: 1 failed: -22

Jun 26 '24 20:06 kv2019i

wow, that's quite a corner case: an xrun while suspend, no one saw that coming.

the part that I am not following is that the xrun seems to be reproducible, which suggests some sort of D0i3 problem in the first place.

Also if the PCM device is stopped before a system suspend, it's not clear to me what userspace will do on resume? Would it go on with a prepare/trigger? or would there be an additional stage of hw_free/hw_params?

Jun 27 '24 06:06 plbossart

@plbossart I'll try to see if this is completely reproducible, but indeed the xrun seems to start the chain of failures. The user-space will continue xrun handling as soon as resume is complete:

# PCM31 is prepared just before user-space is suspended
[  889.160611] snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget pipeline.2 freed
[  889.160621] snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget dai-copier.SSP.NoCodec-0.playback freed
[  889.160638] snd_sof:sof_pcm_prepare: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: prepare stream 31 dir 0
[  889.160641] snd_sof:sof_pcm_hw_params: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: hw params stream 31 dir 0
# then we enter suspend
[  889.311599] ACPI: EC: interrupt blocked
[  893.102613] ACPI: EC: interrupt unblocked
# resume starts, DSP is booted
[  893.207454] snd_sof:snd_sof_run_firmware: sof-audio-pci-intel-lnl 0000:00:1f.3: firmware boot complete
[  893.207465] snd_sof:sof_set_fw_state: sof-audio-pci-intel-lnl 0000:00:1f.3: fw_state change: 6 -> 7
# system resume completes in kernel and resumes user-space
[  893.616716] OOM killer enabled.
[  893.616721] Restarting tasks ... 
# upon resume prepare is called AGAIN on the PCM31 (this is normal after system suspend)
[  893.618603] snd_sof:sof_pcm_prepare: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: prepare stream 31 dir 0
[  893.618613] snd_sof:sof_pcm_hw_params: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: hw params stream 31 dir 0
# to my eyes, the DSP pipeline setup goes just fine, no errors from the IPCs
# then FE PCM START trigger fals and I see no errors from SOF
[  893.639609] snd_sof_intel_hda_common:hda_dai_trigger: sof-audio-pci-intel-lnl 0000:00:1f.3: cmd=1 dai SSP0 Pin direction 0
[  893.639621] sof-audio-pci-intel-lnl 0000:00:1f.3: ASoC: error at soc_dai_trigger on SSP0 Pin: -22
[  893.648704]  Deepbuffer Port0: ASoC: trigger FE cmd: 1 failed: -22
# then we just see user-space tearing down the PCM

In user-space console, this shows as:

rtcwake: wakeup from "mem" using /dev/rtc0 at Wed Jun 26 19:20:20 2024
underrun!!! (at least 1028.938 ms long)
                                       2024-06-26 19:20:20 UTC Sub-Test: [COMMAND] sleep for 5
aplay: pcm_write:2127: write error: Input/output error

So opens are at least why the xrun occurs in the first place (deep buffer, D0i3 mode entry, ...), and second is the why the prepare is called twice and is the flow correct here (I have a hunch the double-prepare confuses the driver state somewhere and thus we get the -22 from FE trigger on the resume).

Jun 27 '24 07:06 kv2019i

Some refinements:

hit the condition (=suspend with substream present) with D0ix disabled (streams marked not compatible with D0ix)
hit the condition once without a preceding xrun

For the latter, I can't really explain why this happens yet. There's a trigger to stop the PCM right after suspend starts but nothing printed out by aplay so this is coming from ALSA core:

[47625.798898] PM: suspend entry (s2idle)
[47625.800453] Filesystems sync: 0.001 seconds
[47626.294808] snd_sof:sof_pcm_trigger: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: trigger stream 0 dir 0 cmd 0

Jun 27 '24 11:06 kv2019i

It seems the failure sequence is related to a system hickup in system process. Whenever SOF see unexpected sequence (no SUSPEND action for the DAI), there is unexpected 2sec gap in the system suspend process. This often generates an xrun, but exact sequence depends on the length of the gap:

# success case, completes within 10ms
[69489.712206] Filesystems sync: 0.001 seconds
[69489.717798] Freezing user space processes
[69489.719699] Freezing user space processes completed (elapsed 0.001 seconds)
# success case:
[69500.740018] Filesystems sync: 0.001 seconds
[69500.745193] Freezing user space processes
[69500.747031] Freezing user space processes completed (elapsed 0.001 seconds)
# success case:
[69511.757591] Filesystems sync: 0.001 seconds
[69511.761816] Freezing user space processes
[69511.763580] Freezing user space processes completed (elapsed 0.001 seconds)
# FAILURE case (without #5085 , kernel would crash here)
[69522.814773] Filesystems sync: 0.001 seconds
# note the 2second gap! user-space will stop the PCM after xrun and this is done before user-space is frozen
[69524.304214] Freezing user space processes
[69524.305612] Freezing user space processes completed (elapsed 0.001 seconds)

Jun 27 '24 14:06 kv2019i

With the kernel fix in place, we now see similar occurences in CI with suspend taking longer than expected and following error in log:

kernel:

[  334.703012] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x13000003|0x0: GLB_SET_PIPELINE_STATE
[  335.206800] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc timed out for 0x13000003|0x0
[... kernel error messages]
[  335.207407] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x46000002|0x3: MOD_UNBIND

FW:

[  334.702861] <inf> ipc: ipc_cmd: rx	: 0x13000003|0x0
[  334.702875] <inf> pipe: pipeline_trigger: pipe:0 0x0 pipe trigger cmd 2
[  334.702953] <inf> ll_schedule: zephyr_ll_task_done: task complete 0xa0119e00 0x20210U
[  334.702956] <inf> ll_schedule: zephyr_ll_task_done: num_tasks 2 total_num_tasks 2
[  335.207238] <inf> ipc: ipc_cmd: rx	: 0x46000002|0x3

So the SET_PIPELNE response is not seen by host. This only happens when the system suspend process has the unusual delay, so something unusual going on on the host system side. I'll keep the bug on driver side until we get more insight on what happens with the suspend.

Jun 28 '24 10:06 kv2019i

So the SET_PIPELNE response is not seen by host.

Is it possible that FW dead before it sent the response? Like, it dead when it process the MOD_UNBIND command.

Jun 28 '24 11:06 bardliao

@ssavati are you able to rerun test now this PR above has merged, if all good pls close. Thanks !

Jul 10 '24 10:07 lgirdwood

@ssavati are you able to rerun test now this PR above has merged, if all good pls close. Thanks !

Sure will run some functional and stress testcase related to Suspend resume

Jul 10 '24 10:07 ssavati

@lgirdwood still observed failures. for more visit CI #/result/planresultdetail/43738

Jul 11 '24 11:07 ssavati

Ack @ssavati @lgirdwood I think this is expected. The kernel PR fixes the kernel crash, but there are still failure. The whole system doesn't hit panic anymore, but we hit a failure nevertheless (see comment https://github.com/thesofproject/linux/issues/5080#issuecomment-2194909579 ).

Jul 11 '24 12:07 kv2019i

@kv2019i @ssavati what's the repro rate of the suspend delay failure ? The 2 second delay looks like we are blocked on pending IO completing. PCM core will copy to DMA buffers, but this may atomically block on audio HW ptr being updated too ?

Jul 12 '24 15:07 lgirdwood

Passing to you for check @ranj063 . One case seen in daily run (43842), not seen in the previous daily.

Jul 16 '24 06:07 kv2019i

@kv2019i I see this in the latest failure and another one a few days ago. after the previous successful D3 entry

[ 7868.464052] kernel: snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: Debug PCIW: 00000010 at  00000044
[ 7868.465195] kernel: snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-lnl 0000:00:1f.3: Current DSP power state: D3
[ 7868.465202] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-lnl 0000:00:1f.3: fw_state change: 7 -> 0
[ 7872.220211] kernel: PM: late suspend of devices failed

I wonder whats causing the late suspend to fail

Jul 16 '24 17:07 ranj063

Spotted this test failure in today's run, not sure whether it's related

43923?model=LNLM_SDW_AIOC&testcase=check-suspend-resume-with-capture-5

[ 4823.610270] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x33000000|0x0: GLB_SET_PIPELINE_STATE
[ 4823.610535] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x13010004|0x0: GLB_SET_PIPELINE_STATE
[ 4823.611330] kernel: snd_sof_intel_hda_common:hda_dsp_stream_trigger: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x14001e successful
[ 4823.614095] kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:2:025d:1316:01: rt1316_io_init hw_init complete
[ 4823.614185] kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling initialization completion for Slave 1
[ 4823.614899] kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:3:025d:1316:01: rt1316_io_init hw_init complete
[ 4823.614913] kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling initialization completion for Slave 1
[ 4823.615022] kernel: snd_sof_intel_hda_common:hda_dai_trigger: soundwire_intel soundwire_intel.link.0: cmd=1 dai SDW0 Pin3 direction 1
[ 4823.615050] kernel: soundwire_intel soundwire_intel.link.0: ASoC: error at soc_dai_trigger on SDW0 Pin3: -22
[ 4823.617591] kernel:  Jack In: ASoC: trigger FE cmd: 1 failed: -22
[ 4823.617767] kernel: snd_sof:sof_pcm_trigger: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: trigger stream 1 dir 1 cmd 0
[ 4823.617787] kernel: snd_sof:sof_ipc4_trigger_pipelines: sof-audio-pci-intel-lnl 0000:00:1f.3: trigger cmd: 0 state: 3
[ 4823.617804] kernel: snd_sof:sof_ipc4_set_pipeline_state: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc4 set pipeline instance 1 state 3
[ 4823.617821] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x13010003|0x0: GLB_SET_PIPELINE_STATE
[ 4823.619171] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x33000000|0x0: GLB_SET_PIPELINE_STATE
[ 4823.619189] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x13010003|0x0:

Jul 16 '24 20:07 marc-hb

Spotted this test failure in today's run, not sure whether it's related

I think we've got 2 different types of failures. But the trigger failure seems to be with suspend-resume-with-capture. I'm looking into this

Jul 16 '24 21:07 ranj063

Spotted this test failure in today's run, not sure whether it's related

43923?model=LNLM_SDW_AIOC&testcase=check-suspend-resume-with-capture-5

[ 4823.610270] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x33000000|0x0: GLB_SET_PIPELINE_STATE
[ 4823.610535] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x13010004|0x0: GLB_SET_PIPELINE_STATE
[ 4823.611330] kernel: snd_sof_intel_hda_common:hda_dsp_stream_trigger: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x14001e successful
[ 4823.614095] kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:2:025d:1316:01: rt1316_io_init hw_init complete
[ 4823.614185] kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling initialization completion for Slave 1
[ 4823.614899] kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:3:025d:1316:01: rt1316_io_init hw_init complete
[ 4823.614913] kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling initialization completion for Slave 1
[ 4823.615022] kernel: snd_sof_intel_hda_common:hda_dai_trigger: soundwire_intel soundwire_intel.link.0: cmd=1 dai SDW0 Pin3 direction 1
[ 4823.615050] kernel: soundwire_intel soundwire_intel.link.0: ASoC: error at soc_dai_trigger on SDW0 Pin3: -22
[ 4823.617591] kernel:  Jack In: ASoC: trigger FE cmd: 1 failed: -22
[ 4823.617767] kernel: snd_sof:sof_pcm_trigger: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: trigger stream 1 dir 1 cmd 0
[ 4823.617787] kernel: snd_sof:sof_ipc4_trigger_pipelines: sof-audio-pci-intel-lnl 0000:00:1f.3: trigger cmd: 0 state: 3
[ 4823.617804] kernel: snd_sof:sof_ipc4_set_pipeline_state: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc4 set pipeline instance 1 state 3
[ 4823.617821] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x13010003|0x0: GLB_SET_PIPELINE_STATE
[ 4823.619171] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x33000000|0x0: GLB_SET_PIPELINE_STATE
[ 4823.619189] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x13010003|0x0:

ok I now know what's causing the trigger failure after resume. So in the passing case, all pipelines (FE & BE) are started before the system suspend is initiated. In this case, we receive the suspend trigger for both the FE and BE DAI and all pipelines are suspended. And during resume, we receive the prepare callback for both the FE and BE DAI's and things are normal.

But in the failing case, what happens in the nocodec case is that the system suspend is initiated as soon the FE DAI hw_params is done but before the FE/BE DAI trigger is called and pipelines are started. Since the trigger start never happened, we never get the suspend trigger either. And during system resume, somehow we receive the prepare callback only for the FE DAI and not for the BE DAI and this is what causes the failure when the BE pipelines are triggered during BE DAI trigger. So the question is if the BE DAI suspend trigger was never called, why is the hext_stream NULL during resume. What causes it to be freed? And why dont we receive the BE DAI prepare callback during resume in this case? Still trying to figure out the answers for these questions

Jul 17 '24 02:07 ranj063

So here's my theory on whats happening. So when the system suspend happens after the bE DAI prepare but before BE DAI trigger, the BE DAI state is PREPARE and the suspend trigger does nothing. When the system resumes, the BE DAI prepare also does nothing also because the state is already PREPARE.

One way to fix this is set the dpcm state of the BE DAI to suspended even if it were is never started but in prepare state. I'm testing this right now.

Jul 17 '24 04:07 ranj063

@ranj063 this looks like a gap in the DPCM state machines

		case SNDRV_PCM_TRIGGER_SUSPEND:
			if (be->dpcm[stream].state != SND_SOC_DPCM_STATE_START)
				goto next;

			be->dpcm[stream].be_start--;
			if (be->dpcm[stream].be_start != 0)
				goto next;

			ret = soc_pcm_trigger(be_substream, cmd);
			if (ret) {
				be->dpcm[stream].be_start++;
				goto next;
			}

			be->dpcm[stream].state = SND_SOC_DPCM_STATE_SUSPEND;
			break;

the if() test will not handle any case where the stream is not started.

Jul 17 '24 08:07 plbossart

@plbossart @ranj063 It seems we call assign_hext_stream() and release_hext_stream() unbalanced. hda_link_dma_hw_params() -> assign_hext_stream() is called in hda_dai_hw_params_data(). However, hda_link_dma_cleanup() -> release_hext_stream() is called in hda_dai_hw_free(), hda_dai_suspend(), and SNDRV_PCM_TRIGGER_SUSPEND. Will it cause any issue?

Jul 17 '24 09:07 bardliao

I am not sure if it's 'unbalanced'.

The hda_link_dma_hw_params() is called in multiple cases, for a regular hw_params but also for a prepare which can be a follow-up to the hw_params, or a way to deal with xruns or a resume.

Likewise we have multiple exit paths, this can be a regular stop/hw_free but we can also have the suspend-while-paused and the suspend-while-running case.

Jul 17 '24 10:07 plbossart

@ranj063 I also wonder if the call to snd_sof_dsp_hw_params_upon_resume() will not cleanup everything, even if the TRIGGER_SUSPEND does nothing.

Jul 17 '24 11:07 plbossart

@ranj063 I also wonder if the call to snd_sof_dsp_hw_params_upon_resume() will not cleanup everything, even if the TRIGGER_SUSPEND does nothing.

@plbossart i am still trying to confirm if this cleans up the dma data or not. My suspicion is that it ends up releasing the hext_stream because we do not check for the dpcm state here. Maybe one possibility is to restrict it for paused state only. But then I'm not sure if its a good thing to leave a stream in the PREPARE state during suspend.

Jul 17 '24 15:07 ranj063

Today's failure in case that's useful: 4012?model=LNLM_SDW_AIOC&testcase=check-suspend-resume-with-playback-5

Jul 18 '24 16:07 marc-hb

@ssavati a fix for this issue has been merged to sof-dev. Please let us know if the issue can be closed now.

Jul 19 '24 16:07 ranj063

Daily 44091?model=LNLM_RVP_NOCODEC&testcase=check-suspend-resume-with-playback-5 still failed with some aplay: pcm_write:2127: write error: Input/output error error. It ran with daily commit 0dff6ac871fa which includes what I guess is the tentative fix d8347dd9bd42

BTW:

https://github.com/thesofproject/sof/issues/8761

Jul 19 '24 18:07 marc-hb

Daily 44091?model=LNLM_RVP_NOCODEC&testcase=check-suspend-resume-with-playback-5 still failed with some aplay: pcm_write:2127: write error: Input/output error error. It ran with daily commit 0dff6ac which includes what I guess is the tentative fix d8347dd

ok so the kernel fix is for handling the trigger failure error but now we've got a different problem with an underrun occurring in the DAI in the firmware after resuming

[ 7846.830920] mixin_mixout: mixin_prepare: comp:0 0x2 mixin_prepare() [ 7846.830933] pipe: pipeline_trigger: pipe:0 0x0 pipe trigger cmd 7 [ 7846.830950] ll_schedule: zephyr_ll_task_schedule_common: task add 0xa0118740 0xa00ebf54U priority 0 flags 0x0 [ 7846.831800] dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy [ 7846.831865] host_comp: host_get_copy_bytes_normal: comp:0 0x4 no bytes to copy, available samples: 0, free_samples: 192 [ 7846.832793] dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy [ 7846.833793] dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy [ 7846.834795] dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy [ 7846.835795] dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy [ 7846.836793] dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy [ 7846.837795] dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy [ 7846.837805] host_comp: host_get_copy_bytes_normal: comp:0 0x4 no bytes to copy, available samples: 384, free_samples: 0 [ 7846.838795] dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy [ 7846.838815] host_comp: host_get_copy_bytes_normal: comp:0 0x4 no bytes to copy, available samples: 384, free_samples: 0 [ 7846.839793] dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy

Jul 19 '24 19:07 ranj063

This sounds like... progress?

Do you want to file a new bug or keep using this one?

Jul 19 '24 22:07 marc-hb

This sounds like... progress?

Do you want to file a new bug or keep using this one?

I'm thinking this might be a firmware bug at this point. Let me analyze it a bit more before we decide to create a SOF bug

Jul 19 '24 22:07 ranj063