sof [BUG] pause resume PID **** had non-zero exit status when checking multiple-pause/resume

Describe the bug This issue happens when checking the multiple-pause/resume case, it shows "pause resume PID **** had non-zero exit status" in console log, but there's no error in kernel log or mtrace. The reproduction rate is low about 20%. This issue can be reproduced on both MTL and LNL NOCODEC platforms.

Inner test ID: LNLM_RVP_NOCODEC: 37146 MTLP_RVP_NOCODEC: 37016

To Reproduce ~/sof-test/test-case/multiple-pause-resume.sh -r 50

Reproduction Rate 20%

Environment

Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
- Kernel: {0ba5acc6d6e4}
- SOF: {16b53b423c75}
Name of the topology file
- Topology: {sof-lnl-nocodec.tplg and sof-mtl-nocodec.tplg}
Name of the platform(s) on which the bug is observed.
- Platform: {LNL-NOCODEC and MTL-NOCODEC}

Screenshots or console output

=== PAUSE ===                                                            
(50/50) pcm'Port1' cmd'aplay' id'9': Wait for 32 ms before resume

2024-01-21 23:46:26 UTC [REMOTE_INFO] pipeline: Deepbuffer Port0 with aplay
2024-01-21 23:46:26 UTC [REMOTE_INFO] pipeline: Port1 with aplay
2024-01-21 23:46:26 UTC [REMOTE_INFO] Check expect exit status
declare -- cmd="journalctl_cmd --since=@1705880315"
2024-01-21 23:46:26 UTC [REMOTE_ERROR] pause resume PID 65117 had non-zero exit status
2024-01-21 23:46:26 UTC [REMOTE_INFO] Starting func_exit_handler(1)

dmesg.txt

mtrace.txt

Jan 23 '24 02:01 keqiaozhang

https://sof-ci.01.org/sofpr/PR8759/build2309/devicetest/index.html?model=TGLU_RVP_NOCODEC-ipc4&testcase=multiple-pause-resume-50

https://sof-ci.01.org/linuxpr/PR4793/build1164/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50

https://sof-ci.01.org/sofpr/PR8803/build2374/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50

https://sof-ci.01.org/sofpr/PR8850/build2646/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50

https://sof-ci.01.org/sofpr/PR8853/build2655/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50

Jan 25 '24 20:01 marc-hb

@serhiy-katsyuba-intel @abonislawski @lyakh @teburd I think I found the reason why the DMA stops moving. The GEN bit is not set for some reason. When the problem occurs, I can see that bit26 of DGCS is not set (added debug to call the existing intel_adsp_hda_dbg() macro on error:

[16060.576496] <inf> ipc: ipc_cmd: rx   : 0x13020004|0x0
[16060.576601] <inf> copier: copier_prepare: comp:2 0x20004 copier_prepare()
[16060.576613] <inf> pipe: pipeline_trigger: pipe:2 0x0 pipe trigger cmd 7
[16060.576630] <inf> ll_schedule: zephyr_ll_task_schedule_common: task add 0xa0109a80 0x20210U priority 0 flags 0x0
WARNING: GEN bit not set after enable!
ZEPHYR_BASE/drivers/dma/dma_intel_adsp_hda.c:232 hda not move(1:0x0x72840), dgcs: 0x80810000, dgbba 0x109c80, dgbs 768, dgbrp 0, dgbwp 0, dgbsp 0, dgmbs 768, dgbllpi 0x0, dglpibi 0x0
[16060.578693] <inf> host_comp: host_get_copy_bytes_normal: comp:2 0x20004 no bytes to copy, available samples: 0, free_samples: 192

So "dgcs: 0x80810000" is wrong and explains why data is not moving and we get flood of "no bytes to copy" messages. Without GEN set, the DMA cannot write to DSP memory.

It seems with multiple cores using the host DMA engine, at least writes to DGCS are not immediately reflected in the read back values. Based on this, I made a few fix attempt pushed at https://github.com/kv2019i/zephyr/commits/202402-hdadma-gen-bit-fix-attempts but these don't completely solve the problem.

Even if I added a 10ms busyloop to poll for GEN bit to be set, it is read back as 0, so something is blocking the setting.

Feb 02 '24 23:02 kv2019i

This will be fixed by https://github.com/zephyrproject-rtos/zephyr/pull/69480 . Assigning to @serhiy-katsyuba-intel for follow-up as you are the fix author. Let's not close this SOF issue until we have the fix integrated to SOF main.

Mar 06 '24 09:03 kv2019i

Closing as fix merged. Please reopen if seen again.

Apr 02 '24 11:04 kv2019i