sof [BUG] multiple-pipeline-capture-50 fails on LNL

Describe the bug This test fails in Intel daily plan, observed on 16th Oct (UPDATE: first fail seen in daily test of 11th Oct, Intel plan 46649, 46596 from 10th Oct still good).

To Reproduce ~/sof-test/test-case/multiple-pipeline.sh -f c -l 50

Reproduction Rate high (every daily)

Expected behavior test passes

Impact

Environment

Linux Branch: topic/sof-dev Linux Commit: a8cc01c750c1 KConfig Branch: master KConfig Commit: 097c69ae4b81 SOF Branch: main SOF Commit: 63f18c0f12db Zephyr Commit: 155f3f3ba688

Name of the topology file
- Topology: sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg
Name of the platform(s) on which the bug is observed.
- Platform: lnl

Screenshots or console output

2024-10-16 14:08:54 UTC [REMOTE_INFO] ===== Testing: (Loop: 10/50) =====
2024-10-16 14:08:54 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg to run the test case
2024-10-16 14:08:54 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2024-10-16 14:08:54 UTC [REMOTE_INFO] Run command to get pipeline parameters
2024-10-16 14:08:54 UTC [REMOTE_COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg -f 'type:capture & ~pcm:Amplifier Reference & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2024-10-16 14:08:54 UTC [REMOTE_INFO] Testing: Amp feedback [hw:0,3]
2024-10-16 14:08:54 UTC [REMOTE_COMMAND] arecord   -D hw:0,3 -c 4 -r 48000 -f S16_LE /dev/null -q
2024-10-16 14:08:54 UTC [REMOTE_INFO] Testing: Microphone [hw:0,4]
2024-10-16 14:08:54 UTC [REMOTE_COMMAND] arecord   -D hw:0,4 -c 2 -r 48000 -f S16_LE /dev/null -q
2024-10-16 14:08:54 UTC [REMOTE_INFO] Testing: Jack In [hw:0,1]
2024-10-16 14:08:54 UTC [REMOTE_INFO] wait 0.5s for aplay_opts()
2024-10-16 14:08:54 UTC [REMOTE_COMMAND] arecord   -D hw:0,1 -c 2 -r 48000 -f S16_LE /dev/null -q
2024-10-16 14:08:54 UTC [REMOTE_INFO] checking pipeline status
2024-10-16 14:08:55 UTC [REMOTE_INFO] Letting playback/capture run for 5s
overrun!!! (at least 61.447 ms long)
arecord: pcm_read:2221: read error: Invalid argument
2024-10-16 14:09:00 UTC [REMOTE_INFO] checking pipeline status again
2024-10-16 14:09:00 UTC [REMOTE_ERROR] Running process count is 2, but 3 is expected
29988 arecord -D hw:0,3 -c 4 -r 48000 -f S16_LE /dev/null -q
30005 arecord -D hw:0,1 -c 2 -r 48000 -f S16_LE /dev/null -q
2024-10-16 14:09:00 UTC [REMOTE_INFO] Starting func_exit_handler(1)
2024-10-16 14:09:00 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2024-10-16 14:09:00 UTC [REMOTE_ERROR]  func_error_exit()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh
2024-10-16 14:09:00 UTC [REMOTE_ERROR]  ps_checks()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:148
2024-10-16 14:09:00 UTC [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:209
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 762: 30005 Killed                  arecord $SOF_ALSA_OPTS $SOF_ARECORD_OPTS "$@"
2024-10-16 14:09:00 UTC [REMOTE_INFO] pkill -TERM -f mtrace-reader.py

Oct 18 '24 08:10 kv2019i

Changed the bug label. This is not seen in pull-request CI, but rather only in the Intel daily test.

Oct 21 '24 13:10 kv2019i

This is not seen in pull-request CI, but rather only in the Intel daily test.

It's both, PR example: https://sof-ci.01.org/softestpr/PR1236/build782/devicetest/index.html

This is clearly a recent regression, happened some time early in October. More info in (unrelated) https://github.com/thesofproject/sof/issues/9587#issuecomment-2417952744

Oct 22 '24 17:10 marc-hb

(1) The problem is not a recent regression. I tried to bisect to find the bad commit and I went down to commit merged on Aug 5 and still can reproduce the problem. It just more harder to reproduce, I run 100 test iteration instead of 50, sometimes it succeeded even with 100 iterations and I run the test again and eventually it fails. Seems on recent builds it is easier to reproduce the problem but the problem exists with builds much older then October builds.

(2) When the problem happens log shows something like this:

[ 1198.015848] <dbg> dai_comp: dai_common_config_prepare: comp:4 0x40004 channel = 0
...
[ 1198.015883] <err> dai_comp: dai_common_config_prepare: comp:4 0x40004 dma_request_channel() failed

Gateway 0x40004 requests HDA link DMA channel 0. However, that same channel is still in use by gateway 0x4:

[ 1193.396756] <dbg> dai_comp: dai_common_config_prepare: comp:0 0x4 channel = 0

From the same log on all older successful iterations of test gateway 0x40004 was always requesting DMA channel 3:

[ 1193.418940] <dbg> dai_comp: dai_common_config_prepare: comp:4 0x40004 channel = 3

On LNL DMA channel is specified in blob that comes from driver. I added more logs to copier_alh_assign_dai_index(). What it shows that on successful iterations of test gateway 0x40004 was assigned stream_id == 4, dai_index ==4 and DMA channel 3. On failed iteration: stream_id == 1, dai_index == 1, DMA channel == 0.

That data comes in blob from driver. There is a small possibility blob is somehow corrupted on DSP side, however, I would suggest to look on driver side is there a possibility wrong stread_id is assigned to blob. Maybe it is some bug in logic selecting stream_id on driver side?

Just in case, I see gateway 0x4 is actually a "multi-gateway": it is a single copier using two gateways internally. Those gateways use HDA link channels 0 and 1. Multi-gateway case is kind of unusual, maybe driver has a bug related to properly handling stream_id when multi-gateways are in use.

Oct 29 '24 08:10 serhiy-katsyuba-intel

@ranj063 @bardliao @ujfalusi Can you take look at above from driver perspective, multi-gatewa ysage with 0x4 expected?

Oct 29 '24 08:10 kv2019i

[ 1198.015848] dai_comp: dai_common_config_prepare: comp:4 0x40004 channel = 0 ... [ 1198.015883] dai_comp: dai_common_config_prepare: comp:4 0x40004 dma_request_channel() failed

... based on this, could be related to https://github.com/thesofproject/sof/issues/9584

Oct 29 '24 12:10 kv2019i

I think the issue might be caused by / related to (kernel commit) 0247199fa8eb ("ASoC: SOF: ipc4-topology: Do not set ALH node_id for aggregated DAIs")

In sof_ipc4_dai_config() if blob->alh_cfg.device_count > 1 then we don't set the copier_data->gtw_cfg.node_id. But in sof_ipc4_prepare_copier_module() the list_for_each_entry(w, &sdev->widget_list, list) loop for ALH will try to use it:

        node_type = SOF_IPC4_GET_NODE_TYPE(alh_data->gtw_cfg.node_id);
        blob->alh_cfg.mapping[i].device = SOF_IPC4_NODE_TYPE(node_type);

and after that loop we set the node_id in case of blob->alh_cfg.device_count > 1

@ranj063

Oct 29 '24 12:10 ujfalusi

I'm not able to reproduce the problem on Bangalore machines: ba-lnlm-rvp-sdw-01 and ba-lnlm-rvp-sdw-03 survived 300 iteration without a problem. On jf-lnlm-rvp-sdw-4, however, I'm able to reproduce the problem.

Oct 29 '24 12:10 serhiy-katsyuba-intel

@serhiy-katsyuba-intel, can it be climate related bug? ;) Seriously, The BA machines have different BIOS settings compared to the ones in JF, but how that can affect ID allocation and DMA channel selection is a mistery to me.

Oct 29 '24 12:10 ujfalusi

The problem is not a recent regression. I tried to bisect to find the bad commit and I went down to commit merged on Aug 5 and still can reproduce the problem. It just more harder to reproduce, I run 100 test iteration instead of 50, sometimes it succeeded even with 100 iterations and I run the test again and eventually it fails.

Thanks!

Seems on recent builds it is easier to reproduce the problem

Yes: I observed a huge jump in the reproduction rate, see data above. No idea why.

Oct 29 '24 23:10 marc-hb

I think the issue might be caused by / related to (kernel commit) 0247199fa8eb ("ASoC: SOF: ipc4-topology: Do not set ALH node_id for aggregated DAIs")

In sof_ipc4_dai_config() if blob->alh_cfg.device_count > 1 then we don't set the copier_data->gtw_cfg.node_id. But in sof_ipc4_prepare_copier_module() the list_for_each_entry(w, &sdev->widget_list, list) loop for ALH will try to use it:
        node_type = SOF_IPC4_GET_NODE_TYPE(alh_data->gtw_cfg.node_id);
        blob->alh_cfg.mapping[i].device = SOF_IPC4_NODE_TYPE(node_type);
and after that loop we set the node_id in case of blob->alh_cfg.device_count > 1

If device_count == 1, gtw_cfg.node_id will be set in sof_ipc4_dai_config() which will be ipc4_copier->dai_index. And gtw_cfg.node_id will be set to an id with ALH_MULTI_GTW_BASE in sof_ipc4_prepare_copier_module(). So, we set gtw_cfg.node_id in all cases. It looks correct to me.

Oct 30 '24 11:10 bardliao

Not seen in PR/daily tests for a week, closing.

Dec 13 '24 11:12 kv2019i