sof icon indicating copy to clipboard operation
sof copied to clipboard

[BUG] xrun on TGLH RVP NOCODEC ZEPHYR when testing multiple pause resume

Open XiaoyunWu6666 opened this issue 2 years ago • 14 comments

Describe the bug When multiple pause resume on TGLH RVP NOCODEC ZEPHYR, xrun happened in inner test result 14447

To Reproduce TPLG=/lib/firmware/intel/sof-tplg/sof-tgl-h-nocodec.tplg MODEL=TGLH_RVP_NOCODEC_ZEPHYR ~/sof-test/test-case/multiple-pause-resume.sh -r 50

Reproduction Rate Observe only once in CI so far

Environment Kernel Branch: topic/sof-dev Kernel Commit: 655b48ffac48 SOF Branch: main SOF Commit: 8881a406af67 Zephyr Commit: zephyr-v3.1.0-2058-g63a497280f22 Topology: sof-tgl-h-nocodec.tplg Platform: TGLH_RVP_NOCODEC_ZEPHYR

Screenshots or console output sof-logger

[    49988287.805311] (          45.104164) c0 pipe         1.3   ....../pipeline-stream.c:270  INFO pipe trigger cmd 6


[    49988417.805305] (          42.239582) c0 zll-schedule       src/schedule/zephyr_ll.c:69   INFO task complete 0xbe081200 pipe-task 
[    49988131.294900] (          57.291664) c0 dw-dma                 src/drivers/dw/dma.c:1128 ERROR dw_dma_get_data_size(): xrun detected
[    49988191.034481] (          59.739582) c0 dai          1.2     src/audio/dai-legacy.c:905  ERROR dai_report_xrun(): underrun due to no data available[    49988242.701146] (          51.666664) c0 dai          1.2     src/audio/dai-legacy.c:787  ERROR comp_underrun(): dev->comp.id = 2, source->avail = 192, copy_bytes = 0
[    49988330.826142] (          43.020832) c0 pipe         1.3   ....../pipeline-stream.c:143  ERROR pipeline_copy(): ret = -61, start->comp.id = 2, dir = 1
[    49988375.565724] (          44.739582) c0 pipe         1.3   ..../pipeline-schedule.c:208  ERROR pipeline_task(): xrun recovery failed! pipeline is stopped.
[    49988437.701138] (          19.895832) c0 zll-schedule       src/schedule/zephyr_ll.c:71   INFO num_tasks 3 total_num_tasks 3
[    49990044.836491] (        1607.135376) c0 host         7.36   src/audio/host-legacy.c:475  INFO no bytes to copy, 0 free in buffer, 384 available in DMA
[    49992040.721828] (        1995.885376) c0 host         7.36   src/audio/host-legacy.c:475  INFO no bytes to copy, 0 free in buffer, 384 available in DMA
[    49994040.930082] (        2000.208252) c0 host         7.36   src/audio/host-legacy.c:475  INFO no bytes to copy, 0 free in buffer, 384 available in DMA
[    49996040.982086] (        2000.052002) c0 host         7.36   src/audio/host-legacy.c:475  INFO no bytes to copy, 0 free in buffer, 384 available in DMA

XiaoyunWu6666 avatar Aug 05 '22 03:08 XiaoyunWu6666

@XiaoyunWu6666 is a brand new regression? Did it start with

  • #5976 (It was merged before 14447 test was run)

just like this other new regression?

  • #6075

marc-hb avatar Aug 05 '22 09:08 marc-hb

TGL: enable CONFIG_INCOHERENT correctly was reverted in #6094

marc-hb avatar Aug 05 '22 11:08 marc-hb

possible similar sighting in https://sof-ci.01.org/linuxpr/PR3796/build485/devicetest/?model=ADLP_RVP_SDW_ZEPHYR&testcase=multiple-pause-resume-5 xrun leading to IPC error.

It's a SoundWire v. nocodec configuration though, it could be a separate issue.

plbossart avatar Aug 05 '22 14:08 plbossart

Pause is always a P2.

lgirdwood avatar Aug 05 '22 14:08 lgirdwood

I disagree @lgirdwood, If we end-up with a fatal IPC error it's a P1. Xruns are tolerable since they are not fatal.

plbossart avatar Aug 05 '22 15:08 plbossart

I disagree @lgirdwood, If we end-up with a fatal IPC error it's a P1. Xruns are tolerable since they are not fatal.

It's P2 as it's real world usage is negligible and I'd rather focus on issues that will hit the majority of users (and we have some). Sure, become a P1 is it causes IPC timeout or crash.

lgirdwood avatar Aug 05 '22 15:08 lgirdwood

https://sof-ci.01.org/linuxpr/PR3796/build485/devicetest/?model=ADLP_RVP_SDW_ZEPHYR&testcase=multiple-pause-resume-5 is an IPC error and crash @lgirdwood it's like a kernel oops, it doesn't matter how we got there, it shouldn't happen. Rather, it shall not happen.

plbossart avatar Aug 05 '22 15:08 plbossart

https://sof-ci.01.org/linuxpr/PR3796/build485/devicetest/?model=ADLP_RVP_SDW_ZEPHYR&testcase=multiple-pause-resume-5 is an IPC error and crash @lgirdwood it's like a kernel oops, it doesn't matter how we got there, it shouldn't happen. Rather, it shall not happen.

This looks like the same issue/use case and it now qualifies as a P1 since it's a FW crash/fatal. i.e no longer a survivable IPC error reply as reported initially but now a fatal IPC timeout.

lgirdwood avatar Aug 05 '22 16:08 lgirdwood

I ran a stress multiple pause resume test on ADLP TGLH TGLU zephyr platforms (NOCODEC and SDW) to see if there are any clues

Found that there was underrun detected on ADLP RVP SDW ZEPHYR too,inner result 14494 on model ADLP_RVP_SDW_ZEPHYR when multiple-pause-resume [sof-logger]

[ 14816335070.157991] (           0.000000) c0 zll-schedule       src/schedule/zephyr_ll.c:280  INFO task add 0xbe08e4c0 pipe-task  priority 0 flags 0x0
[        3901.979012] (        3901.979004) c0 hda-dma            ..../intel/hda/hda-dma.c:948  ERROR hda_dma_link_check_xrun(): underrun detected
[        5899.114349] (        1997.135376) c0 hda-dma            ..../intel/hda/hda-dma.c:948  ERROR hda_dma_link_check_xrun(): underrun detected
[        8726.145487] (        2827.031250) c0 ipc                  src/ipc/ipc3/handler.c:1605 INFO ipc: new cmd 0x60060000
[        8752.916319] (          26.770832) c0 pipe         9.51  ....../pipeline-stream.c:270  INFO pipe trigger cmd 2
[        9810.207944] (        1057.291626) c0 zll-schedule       src/schedule/zephyr_ll.c:69   INFO task complete 0xbe0904c0 pipe-task 
[        9829.062109] (          18.854166) c0 zll-schedule       src/schedule/zephyr_ll.c:71   INFO num_tasks 3 total_num_tasks 3
[       11918.072443] (        2089.010254) c0 ipc                  src/ipc/ipc3/handler.c:1605 INFO ipc: new cmd 0x80010000
[       11947.291192] (          29.218748) c0 ipc                  src/ipc/ipc3/handler.c:595  INFO ipc: dai 3.3 -> config 
[       11968.437024] (          21.145832) c0 dai                           src/lib/dai.c:256  INFO dai_get type 3 index 3 new sref 2
[       11989.009940] (          20.572916) c0 hda-dai      3.3   /drivers/intel/hda/hda.c:42   INFO hda_set_config(): channels 2 rate 48000
[       12009.114106] (          20.104166) c0 dai                           src/lib/dai.c:280  INFO dai_put type 3 index 3 new sref 1
[       12029.999522] (          20.885416) c0 ipc                      src/ipc/ipc3/dai.c:192  INFO ipc_comp_dai_config() dai type = 3 index = 3
[       12057.759938] (          27.760416) c0 dai          9.50        src/ipc/ipc3/dai.c:290  INFO dai_config() dai type = 3 index = 3 dd 0x9e08f900
[       13130.103645] (        1072.343750) c0 ipc                  src/ipc/ipc3/handler.c:1605 INFO ipc: new cmd 0x60070000
[       13156.770311] (          26.666666) c0 pipe         9.51  ....../pipeline-stream.c:270  INFO pipe trigger cmd 8
[       13186.822393] (          30.052082) c0 zll-schedule       src/schedule/zephyr_ll.c:280  INFO task add 0xbe0904c0 pipe-task  priority 0 flags 0x0
[       15900.572285] (        2713.750000) c0 hda-dma            ..../intel/hda/hda-dma.c:948  ERROR hda_dma_link_check_xrun(): underrun detected
[       17898.645122] (        1998.072876) c0 hda-dma            ..../intel/hda/hda-dma.c:948  ERROR hda_dma_link_check_xrun(): underrun detected

no xrun on other platforms so far

XiaoyunWu6666 avatar Aug 08 '22 03:08 XiaoyunWu6666

@marcinszkudlinski fyi - could be related to issues you have been looking at. @lyakh can you take a look at this for TGL ?

lgirdwood avatar Aug 10 '22 14:08 lgirdwood

Doesn't this look like a #5850 duplicate?

lyakh avatar Aug 16 '22 08:08 lyakh

I found that daily test result Nr. 14447 and I see the xrun messages as quoted in this report, but I don't see IPC timeouts there?

lyakh avatar Aug 18 '22 15:08 lyakh

ah, I see that @lgirdwood changed the original report title and added an "IPC timeout" there. Interestingly, I don't see an IPC timeout in that daily result, but I found a similar xrun in daily result Nr. 14634 on ADLP RVP nocodec XTOS and there I do see an IPC timeout, so that one is more similar to #5850 although the latter one is on a Chrome device and it's the HDA DMA that's hitting an xrun there, unlike the former, where it's DW DMA.

lyakh avatar Aug 18 '22 15:08 lyakh

possible similar sighting in https://sof-ci.01.org/linuxpr/PR3796/build485/devicetest/?model=ADLP_RVP_SDW_ZEPHYR&testcase=multiple-pause-resume-5 xrun leading to IPC error.

It's a SoundWire v. nocodec configuration though, it could be a separate issue.

indeed, I'd say this one is more like #5850

lyakh avatar Aug 19 '22 11:08 lyakh

Having studied 5 daily and 1 PR test failures with IPC timeouts I came to the conclusion, that we are looking at 2 failure types:

topology	| DW DMA		| HDA DMA	| IPC timeout	| RTOS
-----------------------------------------------------------------------------------
nocodec		| 2 xruns in the end	| Ok		| on SSP	| both
HDA / SDW	| Ok			| many xruns	| on DMIC	| Zephyr

I tried to reproduce the problem with the multiple-pause-resume test on an UP-Xtreme TGL board with both topologies (nocodec was modified to only include SSP0) - no success so far.

lyakh avatar Aug 25 '22 12:08 lyakh

@lyakh I find the table above confusing, if you have a timeout on DMIC, isn't that using the DW DMA? how can we have xruns on the HDA DMA here?

plbossart avatar Aug 26 '22 08:08 plbossart

@lyakh I find the table above confusing, if you have a timeout on DMIC, isn't that using the DW DMA? how can we have xruns on the HDA DMA here?

@plbossart I meant, that there are no xruns on DW DMA in those cases. HDA DMA xruns - no idea if they're related, e.g.

[     8358500.761613] (        3635.572754) c0 hda-dma            ..../intel/hda/hda-dma.c:948  ERROR hda_dma_link_check_xrun(): underrun detected
[     8360501.699034] (        2000.937378) c0 hda-dma            ..../intel/hda/hda-dma.c:948  ERROR hda_dma_link_check_xrun(): underrun detected

see https://sof-ci.01.org/linuxpr/PR3796/build485/devicetest/?model=ADLP_RVP_SDW_ZEPHYR&testcase=multiple-pause-resume-5

lyakh avatar Aug 26 '22 10:08 lyakh

@lyakh The HDAudio host DMA would be used, there's really no notion of xrun at that level. There's a very large DDR buffer as destination.The xruns only make sense at the link level, no?

plbossart avatar Aug 26 '22 14:08 plbossart

The most recent occurrence of this bug, known to me was on ADLP_RVP_NOCODEC in daily test result 14634 of August 14 2022. Let's wait for another week until a full month with no reproduction, then we can close this.

lyakh avatar Sep 06 '22 10:09 lyakh

No longer reproducing

kfrydryx avatar Sep 13 '22 07:09 kfrydryx