sof
sof copied to clipboard
[BUG] pause resume PID **** had non-zero exit status when checking multiple-pause/resume
Describe the bug This issue happens when checking the multiple-pause/resume case, it shows "pause resume PID **** had non-zero exit status" in console log, but there's no error in kernel log or mtrace. The reproduction rate is low about 20%. This issue can be reproduced on both MTL and LNL NOCODEC platforms.
Inner test ID: LNLM_RVP_NOCODEC: 37146 MTLP_RVP_NOCODEC: 37016
To Reproduce ~/sof-test/test-case/multiple-pause-resume.sh -r 50
Reproduction Rate 20%
Environment
- Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
- Kernel: {0ba5acc6d6e4}
- SOF: {16b53b423c75}
- Name of the topology file
- Topology: {sof-lnl-nocodec.tplg and sof-mtl-nocodec.tplg}
- Name of the platform(s) on which the bug is observed.
- Platform: {LNL-NOCODEC and MTL-NOCODEC}
Screenshots or console output
=== PAUSE ===
(50/50) pcm'Port1' cmd'aplay' id'9': Wait for 32 ms before resume
2024-01-21 23:46:26 UTC [REMOTE_INFO] pipeline: Deepbuffer Port0 with aplay
2024-01-21 23:46:26 UTC [REMOTE_INFO] pipeline: Port1 with aplay
2024-01-21 23:46:26 UTC [REMOTE_INFO] Check expect exit status
declare -- cmd="journalctl_cmd --since=@1705880315"
2024-01-21 23:46:26 UTC [REMOTE_ERROR] pause resume PID 65117 had non-zero exit status
2024-01-21 23:46:26 UTC [REMOTE_INFO] Starting func_exit_handler(1)
https://sof-ci.01.org/sofpr/PR8759/build2309/devicetest/index.html?model=TGLU_RVP_NOCODEC-ipc4&testcase=multiple-pause-resume-50
https://sof-ci.01.org/linuxpr/PR4793/build1164/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50
https://sof-ci.01.org/sofpr/PR8803/build2374/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50
https://sof-ci.01.org/sofpr/PR8850/build2646/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50
https://sof-ci.01.org/sofpr/PR8853/build2655/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50
@serhiy-katsyuba-intel @abonislawski @lyakh @teburd I think I found the reason why the DMA stops moving. The GEN bit is not set for some reason. When the problem occurs, I can see that bit26 of DGCS is not set (added debug to call the existing intel_adsp_hda_dbg() macro on error:
[16060.576496] <inf> ipc: ipc_cmd: rx : 0x13020004|0x0
[16060.576601] <inf> copier: copier_prepare: comp:2 0x20004 copier_prepare()
[16060.576613] <inf> pipe: pipeline_trigger: pipe:2 0x0 pipe trigger cmd 7
[16060.576630] <inf> ll_schedule: zephyr_ll_task_schedule_common: task add 0xa0109a80 0x20210U priority 0 flags 0x0
WARNING: GEN bit not set after enable!
ZEPHYR_BASE/drivers/dma/dma_intel_adsp_hda.c:232 hda not move(1:0x0x72840), dgcs: 0x80810000, dgbba 0x109c80, dgbs 768, dgbrp 0, dgbwp 0, dgbsp 0, dgmbs 768, dgbllpi 0x0, dglpibi 0x0
[16060.578693] <inf> host_comp: host_get_copy_bytes_normal: comp:2 0x20004 no bytes to copy, available samples: 0, free_samples: 192
So "dgcs: 0x80810000" is wrong and explains why data is not moving and we get flood of "no bytes to copy" messages. Without GEN set, the DMA cannot write to DSP memory.
It seems with multiple cores using the host DMA engine, at least writes to DGCS are not immediately reflected in the read back values. Based on this, I made a few fix attempt pushed at https://github.com/kv2019i/zephyr/commits/202402-hdadma-gen-bit-fix-attempts but these don't completely solve the problem.
Even if I added a 10ms busyloop to poll for GEN bit to be set, it is read back as 0, so something is blocking the setting.
This will be fixed by https://github.com/zephyrproject-rtos/zephyr/pull/69480 . Assigning to @serhiy-katsyuba-intel for follow-up as you are the fix author. Let's not close this SOF issue until we have the fix integrated to SOF main.
Closing as fix merged. Please reopen if seen again.