[BUG] multiple-pipeline-capture-50 fails on LNL
Describe the bug This test fails in Intel daily plan, observed on 16th Oct (UPDATE: first fail seen in daily test of 11th Oct, Intel plan 46649, 46596 from 10th Oct still good).
To Reproduce ~/sof-test/test-case/multiple-pipeline.sh -f c -l 50
Reproduction Rate high (every daily)
Expected behavior test passes
Impact
Environment
Linux Branch: topic/sof-dev Linux Commit: a8cc01c750c1 KConfig Branch: master KConfig Commit: 097c69ae4b81 SOF Branch: main SOF Commit: 63f18c0f12db Zephyr Commit: 155f3f3ba688
- Name of the topology file
- Topology: sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg
- Name of the platform(s) on which the bug is observed.
- Platform: lnl
Screenshots or console output
2024-10-16 14:08:54 UTC [REMOTE_INFO] ===== Testing: (Loop: 10/50) =====
2024-10-16 14:08:54 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg to run the test case
2024-10-16 14:08:54 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2024-10-16 14:08:54 UTC [REMOTE_INFO] Run command to get pipeline parameters
2024-10-16 14:08:54 UTC [REMOTE_COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg -f 'type:capture & ~pcm:Amplifier Reference & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2024-10-16 14:08:54 UTC [REMOTE_INFO] Testing: Amp feedback [hw:0,3]
2024-10-16 14:08:54 UTC [REMOTE_COMMAND] arecord -D hw:0,3 -c 4 -r 48000 -f S16_LE /dev/null -q
2024-10-16 14:08:54 UTC [REMOTE_INFO] Testing: Microphone [hw:0,4]
2024-10-16 14:08:54 UTC [REMOTE_COMMAND] arecord -D hw:0,4 -c 2 -r 48000 -f S16_LE /dev/null -q
2024-10-16 14:08:54 UTC [REMOTE_INFO] Testing: Jack In [hw:0,1]
2024-10-16 14:08:54 UTC [REMOTE_INFO] wait 0.5s for aplay_opts()
2024-10-16 14:08:54 UTC [REMOTE_COMMAND] arecord -D hw:0,1 -c 2 -r 48000 -f S16_LE /dev/null -q
2024-10-16 14:08:54 UTC [REMOTE_INFO] checking pipeline status
2024-10-16 14:08:55 UTC [REMOTE_INFO] Letting playback/capture run for 5s
overrun!!! (at least 61.447 ms long)
arecord: pcm_read:2221: read error: Invalid argument
2024-10-16 14:09:00 UTC [REMOTE_INFO] checking pipeline status again
2024-10-16 14:09:00 UTC [REMOTE_ERROR] Running process count is 2, but 3 is expected
29988 arecord -D hw:0,3 -c 4 -r 48000 -f S16_LE /dev/null -q
30005 arecord -D hw:0,1 -c 2 -r 48000 -f S16_LE /dev/null -q
2024-10-16 14:09:00 UTC [REMOTE_INFO] Starting func_exit_handler(1)
2024-10-16 14:09:00 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2024-10-16 14:09:00 UTC [REMOTE_ERROR] func_error_exit() @ /home/ubuntu/sof-test/test-case/multiple-pipeline.sh
2024-10-16 14:09:00 UTC [REMOTE_ERROR] ps_checks() @ /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:148
2024-10-16 14:09:00 UTC [REMOTE_ERROR] main() @ /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:209
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 762: 30005 Killed arecord $SOF_ALSA_OPTS $SOF_ARECORD_OPTS "$@"
2024-10-16 14:09:00 UTC [REMOTE_INFO] pkill -TERM -f mtrace-reader.py
Changed the bug label. This is not seen in pull-request CI, but rather only in the Intel daily test.
This is not seen in pull-request CI, but rather only in the Intel daily test.
It's both, PR example: https://sof-ci.01.org/softestpr/PR1236/build782/devicetest/index.html
This is clearly a recent regression, happened some time early in October. More info in (unrelated) https://github.com/thesofproject/sof/issues/9587#issuecomment-2417952744
(1) The problem is not a recent regression. I tried to bisect to find the bad commit and I went down to commit merged on Aug 5 and still can reproduce the problem. It just more harder to reproduce, I run 100 test iteration instead of 50, sometimes it succeeded even with 100 iterations and I run the test again and eventually it fails. Seems on recent builds it is easier to reproduce the problem but the problem exists with builds much older then October builds.
(2) When the problem happens log shows something like this:
[ 1198.015848] <dbg> dai_comp: dai_common_config_prepare: comp:4 0x40004 channel = 0
...
[ 1198.015883] <err> dai_comp: dai_common_config_prepare: comp:4 0x40004 dma_request_channel() failed
Gateway 0x40004 requests HDA link DMA channel 0. However, that same channel is still in use by gateway 0x4:
[ 1193.396756] <dbg> dai_comp: dai_common_config_prepare: comp:0 0x4 channel = 0
From the same log on all older successful iterations of test gateway 0x40004 was always requesting DMA channel 3:
[ 1193.418940] <dbg> dai_comp: dai_common_config_prepare: comp:4 0x40004 channel = 3
On LNL DMA channel is specified in blob that comes from driver. I added more logs to copier_alh_assign_dai_index(). What it shows that on successful iterations of test gateway 0x40004 was assigned stream_id == 4, dai_index ==4 and DMA channel 3. On failed iteration: stream_id == 1, dai_index == 1, DMA channel == 0.
That data comes in blob from driver. There is a small possibility blob is somehow corrupted on DSP side, however, I would suggest to look on driver side is there a possibility wrong stread_id is assigned to blob. Maybe it is some bug in logic selecting stream_id on driver side?
Just in case, I see gateway 0x4 is actually a "multi-gateway": it is a single copier using two gateways internally. Those gateways use HDA link channels 0 and 1. Multi-gateway case is kind of unusual, maybe driver has a bug related to properly handling stream_id when multi-gateways are in use.
@ranj063 @bardliao @ujfalusi Can you take look at above from driver perspective, multi-gatewa ysage with 0x4 expected?
[ 1198.015848]
dai_comp: dai_common_config_prepare: comp:4 0x40004 channel = 0 ... [ 1198.015883] dai_comp: dai_common_config_prepare: comp:4 0x40004 dma_request_channel() failed
... based on this, could be related to https://github.com/thesofproject/sof/issues/9584
I think the issue might be caused by / related to (kernel commit) 0247199fa8eb ("ASoC: SOF: ipc4-topology: Do not set ALH node_id for aggregated DAIs")
In sof_ipc4_dai_config() if blob->alh_cfg.device_count > 1 then we don't set the copier_data->gtw_cfg.node_id.
But in sof_ipc4_prepare_copier_module() the list_for_each_entry(w, &sdev->widget_list, list) loop for ALH will try to use it:
node_type = SOF_IPC4_GET_NODE_TYPE(alh_data->gtw_cfg.node_id);
blob->alh_cfg.mapping[i].device = SOF_IPC4_NODE_TYPE(node_type);
and after that loop we set the node_id in case of blob->alh_cfg.device_count > 1
@ranj063
I'm not able to reproduce the problem on Bangalore machines: ba-lnlm-rvp-sdw-01 and ba-lnlm-rvp-sdw-03 survived 300 iteration without a problem. On jf-lnlm-rvp-sdw-4, however, I'm able to reproduce the problem.
@serhiy-katsyuba-intel, can it be climate related bug? ;) Seriously, The BA machines have different BIOS settings compared to the ones in JF, but how that can affect ID allocation and DMA channel selection is a mistery to me.
The problem is not a recent regression. I tried to bisect to find the bad commit and I went down to commit merged on Aug 5 and still can reproduce the problem. It just more harder to reproduce, I run 100 test iteration instead of 50, sometimes it succeeded even with 100 iterations and I run the test again and eventually it fails.
Thanks!
Seems on recent builds it is easier to reproduce the problem
Yes: I observed a huge jump in the reproduction rate, see data above. No idea why.
I think the issue might be caused by / related to (kernel commit) 0247199fa8eb ("ASoC: SOF: ipc4-topology: Do not set ALH node_id for aggregated DAIs")
In sof_ipc4_dai_config() if
blob->alh_cfg.device_count > 1then we don't set thecopier_data->gtw_cfg.node_id. But in sof_ipc4_prepare_copier_module() the list_for_each_entry(w, &sdev->widget_list, list) loop for ALH will try to use it:node_type = SOF_IPC4_GET_NODE_TYPE(alh_data->gtw_cfg.node_id); blob->alh_cfg.mapping[i].device = SOF_IPC4_NODE_TYPE(node_type);and after that loop we set the node_id in case of
blob->alh_cfg.device_count > 1
If device_count == 1, gtw_cfg.node_id will be set in sof_ipc4_dai_config() which will be ipc4_copier->dai_index. And gtw_cfg.node_id will be set to an id with ALH_MULTI_GTW_BASE in sof_ipc4_prepare_copier_module(). So, we set gtw_cfg.node_id in all cases. It looks correct to me.
Not seen in PR/daily tests for a week, closing.