sof
sof copied to clipboard
[BUG] xrun on TGLH RVP NOCODEC ZEPHYR when testing multiple pause resume
Describe the bug When multiple pause resume on TGLH RVP NOCODEC ZEPHYR, xrun happened in inner test result 14447
To Reproduce TPLG=/lib/firmware/intel/sof-tplg/sof-tgl-h-nocodec.tplg MODEL=TGLH_RVP_NOCODEC_ZEPHYR ~/sof-test/test-case/multiple-pause-resume.sh -r 50
Reproduction Rate Observe only once in CI so far
Environment Kernel Branch: topic/sof-dev Kernel Commit: 655b48ffac48 SOF Branch: main SOF Commit: 8881a406af67 Zephyr Commit: zephyr-v3.1.0-2058-g63a497280f22 Topology: sof-tgl-h-nocodec.tplg Platform: TGLH_RVP_NOCODEC_ZEPHYR
Screenshots or console output sof-logger
[ 49988287.805311] ( 45.104164) c0 pipe 1.3 ....../pipeline-stream.c:270 INFO pipe trigger cmd 6
[ 49988417.805305] ( 42.239582) c0 zll-schedule src/schedule/zephyr_ll.c:69 INFO task complete 0xbe081200 pipe-task
[ 49988131.294900] ( 57.291664) c0 dw-dma src/drivers/dw/dma.c:1128 ERROR dw_dma_get_data_size(): xrun detected
[ 49988191.034481] ( 59.739582) c0 dai 1.2 src/audio/dai-legacy.c:905 ERROR dai_report_xrun(): underrun due to no data available[ 49988242.701146] ( 51.666664) c0 dai 1.2 src/audio/dai-legacy.c:787 ERROR comp_underrun(): dev->comp.id = 2, source->avail = 192, copy_bytes = 0
[ 49988330.826142] ( 43.020832) c0 pipe 1.3 ....../pipeline-stream.c:143 ERROR pipeline_copy(): ret = -61, start->comp.id = 2, dir = 1
[ 49988375.565724] ( 44.739582) c0 pipe 1.3 ..../pipeline-schedule.c:208 ERROR pipeline_task(): xrun recovery failed! pipeline is stopped.
[ 49988437.701138] ( 19.895832) c0 zll-schedule src/schedule/zephyr_ll.c:71 INFO num_tasks 3 total_num_tasks 3
[ 49990044.836491] ( 1607.135376) c0 host 7.36 src/audio/host-legacy.c:475 INFO no bytes to copy, 0 free in buffer, 384 available in DMA
[ 49992040.721828] ( 1995.885376) c0 host 7.36 src/audio/host-legacy.c:475 INFO no bytes to copy, 0 free in buffer, 384 available in DMA
[ 49994040.930082] ( 2000.208252) c0 host 7.36 src/audio/host-legacy.c:475 INFO no bytes to copy, 0 free in buffer, 384 available in DMA
[ 49996040.982086] ( 2000.052002) c0 host 7.36 src/audio/host-legacy.c:475 INFO no bytes to copy, 0 free in buffer, 384 available in DMA
@XiaoyunWu6666 is a brand new regression? Did it start with
- #5976 (It was merged before 14447 test was run)
just like this other new regression?
- #6075
TGL: enable CONFIG_INCOHERENT correctly
was reverted in #6094
possible similar sighting in https://sof-ci.01.org/linuxpr/PR3796/build485/devicetest/?model=ADLP_RVP_SDW_ZEPHYR&testcase=multiple-pause-resume-5 xrun leading to IPC error.
It's a SoundWire v. nocodec configuration though, it could be a separate issue.
Pause is always a P2.
I disagree @lgirdwood, If we end-up with a fatal IPC error it's a P1. Xruns are tolerable since they are not fatal.
I disagree @lgirdwood, If we end-up with a fatal IPC error it's a P1. Xruns are tolerable since they are not fatal.
It's P2 as it's real world usage is negligible and I'd rather focus on issues that will hit the majority of users (and we have some). Sure, become a P1 is it causes IPC timeout or crash.
https://sof-ci.01.org/linuxpr/PR3796/build485/devicetest/?model=ADLP_RVP_SDW_ZEPHYR&testcase=multiple-pause-resume-5 is an IPC error and crash @lgirdwood it's like a kernel oops, it doesn't matter how we got there, it shouldn't happen. Rather, it shall not happen.
https://sof-ci.01.org/linuxpr/PR3796/build485/devicetest/?model=ADLP_RVP_SDW_ZEPHYR&testcase=multiple-pause-resume-5 is an IPC error and crash @lgirdwood it's like a kernel oops, it doesn't matter how we got there, it shouldn't happen. Rather, it shall not happen.
This looks like the same issue/use case and it now qualifies as a P1 since it's a FW crash/fatal. i.e no longer a survivable IPC error reply as reported initially but now a fatal IPC timeout.
I ran a stress multiple pause resume test on ADLP TGLH TGLU zephyr platforms (NOCODEC and SDW) to see if there are any clues
Found that there was underrun detected on ADLP RVP SDW ZEPHYR too,inner result 14494 on model ADLP_RVP_SDW_ZEPHYR when multiple-pause-resume [sof-logger]
[ 14816335070.157991] ( 0.000000) c0 zll-schedule src/schedule/zephyr_ll.c:280 INFO task add 0xbe08e4c0 pipe-task priority 0 flags 0x0
[ 3901.979012] ( 3901.979004) c0 hda-dma ..../intel/hda/hda-dma.c:948 ERROR hda_dma_link_check_xrun(): underrun detected
[ 5899.114349] ( 1997.135376) c0 hda-dma ..../intel/hda/hda-dma.c:948 ERROR hda_dma_link_check_xrun(): underrun detected
[ 8726.145487] ( 2827.031250) c0 ipc src/ipc/ipc3/handler.c:1605 INFO ipc: new cmd 0x60060000
[ 8752.916319] ( 26.770832) c0 pipe 9.51 ....../pipeline-stream.c:270 INFO pipe trigger cmd 2
[ 9810.207944] ( 1057.291626) c0 zll-schedule src/schedule/zephyr_ll.c:69 INFO task complete 0xbe0904c0 pipe-task
[ 9829.062109] ( 18.854166) c0 zll-schedule src/schedule/zephyr_ll.c:71 INFO num_tasks 3 total_num_tasks 3
[ 11918.072443] ( 2089.010254) c0 ipc src/ipc/ipc3/handler.c:1605 INFO ipc: new cmd 0x80010000
[ 11947.291192] ( 29.218748) c0 ipc src/ipc/ipc3/handler.c:595 INFO ipc: dai 3.3 -> config
[ 11968.437024] ( 21.145832) c0 dai src/lib/dai.c:256 INFO dai_get type 3 index 3 new sref 2
[ 11989.009940] ( 20.572916) c0 hda-dai 3.3 /drivers/intel/hda/hda.c:42 INFO hda_set_config(): channels 2 rate 48000
[ 12009.114106] ( 20.104166) c0 dai src/lib/dai.c:280 INFO dai_put type 3 index 3 new sref 1
[ 12029.999522] ( 20.885416) c0 ipc src/ipc/ipc3/dai.c:192 INFO ipc_comp_dai_config() dai type = 3 index = 3
[ 12057.759938] ( 27.760416) c0 dai 9.50 src/ipc/ipc3/dai.c:290 INFO dai_config() dai type = 3 index = 3 dd 0x9e08f900
[ 13130.103645] ( 1072.343750) c0 ipc src/ipc/ipc3/handler.c:1605 INFO ipc: new cmd 0x60070000
[ 13156.770311] ( 26.666666) c0 pipe 9.51 ....../pipeline-stream.c:270 INFO pipe trigger cmd 8
[ 13186.822393] ( 30.052082) c0 zll-schedule src/schedule/zephyr_ll.c:280 INFO task add 0xbe0904c0 pipe-task priority 0 flags 0x0
[ 15900.572285] ( 2713.750000) c0 hda-dma ..../intel/hda/hda-dma.c:948 ERROR hda_dma_link_check_xrun(): underrun detected
[ 17898.645122] ( 1998.072876) c0 hda-dma ..../intel/hda/hda-dma.c:948 ERROR hda_dma_link_check_xrun(): underrun detected
no xrun on other platforms so far
@marcinszkudlinski fyi - could be related to issues you have been looking at. @lyakh can you take a look at this for TGL ?
Doesn't this look like a #5850 duplicate?
I found that daily test result Nr. 14447 and I see the xrun messages as quoted in this report, but I don't see IPC timeouts there?
ah, I see that @lgirdwood changed the original report title and added an "IPC timeout" there. Interestingly, I don't see an IPC timeout in that daily result, but I found a similar xrun in daily result Nr. 14634 on ADLP RVP nocodec XTOS and there I do see an IPC timeout, so that one is more similar to #5850 although the latter one is on a Chrome device and it's the HDA DMA that's hitting an xrun there, unlike the former, where it's DW DMA.
possible similar sighting in https://sof-ci.01.org/linuxpr/PR3796/build485/devicetest/?model=ADLP_RVP_SDW_ZEPHYR&testcase=multiple-pause-resume-5 xrun leading to IPC error.
It's a SoundWire v. nocodec configuration though, it could be a separate issue.
indeed, I'd say this one is more like #5850
Having studied 5 daily and 1 PR test failures with IPC timeouts I came to the conclusion, that we are looking at 2 failure types:
topology | DW DMA | HDA DMA | IPC timeout | RTOS
-----------------------------------------------------------------------------------
nocodec | 2 xruns in the end | Ok | on SSP | both
HDA / SDW | Ok | many xruns | on DMIC | Zephyr
I tried to reproduce the problem with the multiple-pause-resume test on an UP-Xtreme TGL board with both topologies (nocodec was modified to only include SSP0) - no success so far.
@lyakh I find the table above confusing, if you have a timeout on DMIC, isn't that using the DW DMA? how can we have xruns on the HDA DMA here?
@lyakh I find the table above confusing, if you have a timeout on DMIC, isn't that using the DW DMA? how can we have xruns on the HDA DMA here?
@plbossart I meant, that there are no xruns on DW DMA in those cases. HDA DMA xruns - no idea if they're related, e.g.
[ 8358500.761613] ( 3635.572754) c0 hda-dma ..../intel/hda/hda-dma.c:948 ERROR hda_dma_link_check_xrun(): underrun detected
[ 8360501.699034] ( 2000.937378) c0 hda-dma ..../intel/hda/hda-dma.c:948 ERROR hda_dma_link_check_xrun(): underrun detected
see https://sof-ci.01.org/linuxpr/PR3796/build485/devicetest/?model=ADLP_RVP_SDW_ZEPHYR&testcase=multiple-pause-resume-5
@lyakh The HDAudio host DMA would be used, there's really no notion of xrun at that level. There's a very large DDR buffer as destination.The xruns only make sense at the link level, no?
The most recent occurrence of this bug, known to me was on ADLP_RVP_NOCODEC in daily test result 14634 of August 14 2022. Let's wait for another week until a full month with no reproduction, then we can close this.
No longer reproducing