linux icon indicating copy to clipboard operation
linux copied to clipboard

[LNL] Multiple Pipeline Playback/capture stress test fails.

Open ssavati opened this issue 1 year ago • 7 comments

Multiple Pipeline Playback/capture t stress fails for HDA/NOCODEC/SDW. The issue reproduction is low but happens sporadically during CI stress test.

Steps HDA Test command TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg MODEL=LNLM_RVP_HDA SOF_TEST_INTERVAL=5 ~/sof-test/test-case/multiple-pipeline.sh -f p -l 200

NOCDEC Tes tcommand TPLG=/lib/firmware/intel/development/sof-lnl-nocodec.tplg MODEL=LNLM_RVP_NOCODEC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/multiple-pipeline.sh -f p -l 200

config Linux Branch: topic/sof-dev Linux Commit: c75a95d34885

SOF Branch: stable-v2.10 SOF Commit: b15f1f1a3238

Issue reproduction : sporadic and low

2024-06-25 22:24:06 UTC [REMOTE_INFO] checking pipeline status again
2024-06-25 22:24:06 UTC [REMOTE_COMMAND] pkill -9 aplay arecord
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156308 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156301 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
declare -- cmd="journalctl_cmd --since=@1719354239"
2024-06-25 22:24:07 UTC [REMOTE_INFO] ===== Testing: (Loop: 31/200) =====
2024-06-25 22:24:07 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg to run the test case
2024-06-25 22:24:07 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2024-06-25 22:24:07 UTC [REMOTE_INFO] Run command to get pipeline parameters
2024-06-25 22:24:07 UTC [REMOTE_COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg -f 'type:playback  & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2024-06-25 22:24:07 UTC [REMOTE_INFO] Testing: HDA Analog [hw:0,0]
2024-06-25 22:24:07 UTC [REMOTE_COMMAND] aplay   -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-25 22:24:07 UTC [REMOTE_INFO] Testing: Deepbuffer HDA Analog [hw:0,31]
2024-06-25 22:24:07 UTC [REMOTE_INFO] wait 0.5s for aplay_opts()
2024-06-25 22:24:07 UTC [REMOTE_COMMAND] aplay   -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-25 22:24:08 UTC [REMOTE_INFO] checking pipeline status
2024-06-25 22:24:08 UTC [REMOTE_INFO] Letting playback/capture run for 5s
2024-06-25 22:24:13 UTC [REMOTE_INFO] checking pipeline status again
2024-06-25 22:24:13 UTC [REMOTE_COMMAND] pkill -9 aplay arecord
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156408 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156400 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
declare -- cmd="journalctl_cmd --since=@1719354246"
2024-06-25 22:24:14 UTC [REMOTE_INFO] ===== Testing: (Loop: 32/200) =====
2024-06-25 22:24:14 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg to run the test case
2024-06-25 22:24:14 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2024-06-25 22:24:14 UTC [REMOTE_INFO] Run command to get pipeline parameters
2024-06-25 22:24:14 UTC [REMOTE_COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg -f 'type:playback  & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2024-06-25 22:24:14 UTC [REMOTE_INFO] Testing: HDA Analog [hw:0,0]
2024-06-25 22:24:14 UTC [REMOTE_COMMAND] aplay   -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-25 22:24:14 UTC [REMOTE_INFO] Testing: Deepbuffer HDA Analog [hw:0,31]
2024-06-25 22:24:14 UTC [REMOTE_COMMAND] aplay   -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-25 22:24:14 UTC [REMOTE_INFO] wait 0.5s for aplay_opts()
2024-06-25 22:24:14 UTC [REMOTE_INFO] checking pipeline status
2024-06-25 22:24:14 UTC [REMOTE_INFO] Letting playback/capture run for 5s
2024-06-25 22:24:19 UTC [REMOTE_INFO] checking pipeline status again
2024-06-25 22:24:19 UTC [REMOTE_COMMAND] pkill -9 aplay arecord
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156499 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
declare -- cmd="journalctl_cmd --since=@1719354253"
2024-06-25 22:24:21 UTC [REMOTE_INFO] ===== Testing: (Loop: 33/200) =====
2024-06-25 22:24:21 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg to run the test case
2024-06-25 22:24:21 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2024-06-25 22:24:21 UTC [REMOTE_INFO] Run command to get pipeline parameters
2024-06-25 22:24:21 UTC [REMOTE_COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg -f 'type:playback  & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2024-06-25 22:24:21 UTC [REMOTE_INFO] Testing: HDA Analog [hw:0,0]
2024-06-25 22:24:21 UTC [REMOTE_COMMAND] aplay   -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-25 22:24:21 UTC [REMOTE_INFO] Testing: Deepbuffer HDA Analog [hw:0,31]
2024-06-25 22:24:21 UTC [REMOTE_INFO] wait 0.5s for aplay_opts()
2024-06-25 22:24:21 UTC [REMOTE_COMMAND] aplay   -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
aplay: main:831: audio open error: Device or resource busy
2024-06-25 22:24:21 UTC [REMOTE_INFO] checking pipeline status
2024-06-25 22:24:21 UTC [REMOTE_INFO] Letting playback/capture run for 5s
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156506 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
2024-06-25 22:24:26 UTC [REMOTE_INFO] checking pipeline status again
2024-06-25 22:24:26 UTC [REMOTE_ERROR] Running process count is 1, but 2 is expected
156597 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156597 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
2024-06-25 22:24:26 UTC [REMOTE_INFO] Starting func_exit_handler(1)
2024-06-25 22:24:26 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2024-06-25 22:24:26 UTC [REMOTE_ERROR]  func_error_exit()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh
2024-06-25 22:24:26 UTC [REMOTE_ERROR]  ps_checks()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:148
2024-06-25 22:24:26 UTC [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:202
2024-06-25 22:24:31 UTC [REMOTE_INFO] pkill -TERM -f mtrace-reader.py
2024-06-25 22:24:31 UTC [REMOTE_INFO] nlines=4914 /home/ubuntu/sof-test/logs/multiple-pipeline/2024-06-25-22:20:33-2957/mtrace.txt
+ grep -B 2 -A 1 -i --word-regexp -e ERR -e ERROR -e '' -e OSError /home/ubuntu/sof-test/logs/multiple-pipeline/2024-06-25-22:20:33-2957/mtrace.txt
2024-06-25 22:24:37 UTC [REMOTE_INFO] ktime=14535 sof-test PID=153125: ending
2024-06-25 22:24:37 UTC [REMOTE_INFO] Test Result: FAIL!

ssavati avatar Jun 26 '24 11:06 ssavati

"pkill -9 aplay arecord" cmd not able to terminate as one of aplay/arecord instance is stuck/hang/blocked. So next iteration fails to start new instances playback/capture

ssavati avatar Jun 26 '24 11:06 ssavati

if https://github.com/thesofproject/sof-test/pull/1213 is merged, testing with it should tell us whether any aplay / arecord processes are blocked in the kernel. As suggested by @lgirdwood enabling some kernel soft lockup debugging options could help too

lyakh avatar Jun 28 '24 09:06 lyakh

it's already enabled @lyakh, see https://github.com/thesofproject/kconfig/blob/8189104a4f38167aaaa4e70eece20ba4b6c19ef1/kconfig-sof-default.sh#L21 and https://github.com/thesofproject/kconfig/blob/master/lock-stall-defconfig

plbossart avatar Jul 02 '24 09:07 plbossart

I've now tried to reproduce this on both HDA and nocodec LNL setup. On HDA I only have 5 successful test runs, but on nocodec I just had 17 successful runs (about 7 hours) in a row. So the reproduction rate is very low. I still intend to catch up the nocodec numbers on HDA, but its starting to look like this issue does not show itself in reasonable time.

jsarha avatar Jul 04 '24 14:07 jsarha

Now I have run 19 successful runs (about 8 hours) on LNL HDA RVP. So that case too does not appear to reproduce. @lgirdwood , @kv2019i , should we close this one?

BTW, both tests were run with relatively recent daily build.

jsarha avatar Jul 08 '24 17:07 jsarha

Ack @jsarha . @ssavati ok to close or is this still seen in larger test plans? So either close or keep open as P3.

kv2019i avatar Jul 08 '24 18:07 kv2019i

if https://github.com/thesofproject/sof-test/pull/1213 is merged, testing with it should tell us whether any aplay / arecord processes are blocked in the kernel.

This just happened in daily test run /44499?model=LNLM_RVP_NOCODEC&testcase=multiple-pipeline-all-50

[ 5141.981465] kernel: task:aplay           state:R stack:10456 pid:60957 tgid:60957 ppid:60953  flags:0x00000006
[ 5141.981468] kernel: Call Trace:
[ 5141.981469] kernel:  
[ 5141.981470] kernel:  __schedule+0x311/0xa80
[ 5141.981473] kernel:  schedule+0x3e/0x130
[ 5141.981474] kernel:  schedule_timeout+0xb1/0x130
[ 5141.981476] kernel:  ? __pfx_process_timeout+0x10/0x10
[ 5141.981480] kernel:  __snd_pcm_lib_xfer+0x6ab/0x900 [snd_pcm]
[ 5141.981488] kernel:  ? __pfx_interleaved_copy+0x10/0x10 [snd_pcm]
[ 5141.981492] kernel:  ? __snd_pcm_lib_xfer+0x1a2/0x900 [snd_pcm]
[ 5141.981496] kernel:  ? __pfx_default_write_copy+0x10/0x10 [snd_pcm]
[ 5141.981500] kernel:  ? __pfx_default_wake_function+0x10/0x10
[ 5141.981503] kernel:  snd_pcm_common_ioctl+0xf7a/0x1600 [snd_pcm]
[ 5141.981507] kernel:  ? ioctl_has_perm.constprop.0.isra.0+0xba/0x110
[ 5141.981511] kernel:  snd_pcm_ioctl+0x27/0x40 [snd_pcm]
[ 5141.981515] kernel:  __x64_sys_ioctl+0x8e/0xd0
[ 5141.981517] kernel:  do_syscall_64+0x9e/0x1a0
[ 5141.981519] kernel:  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 5141.981520] kernel: RIP: 0033:0x7f42f861794f
[ 5141.981521] kernel: RSP: 002b:00007ffe8f159ed0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 5141.981522] kernel: RAX: ffffffffffffffda RBX: 000055b4ddb26010 RCX: 00007f42f861794f
[ 5141.981523] kernel: RDX: 00007ffe8f159f30 RSI: 0000000040184150 RDI: 0000000000000004
[ 5141.981524] kernel: RBP: 000055b4ddb25f90 R08: 0000000000000000 R09: 000055b4ddb288c0
[ 5141.981524] kernel: R10: 000055b4ddb2e000 R11: 0000000000000246 R12: 0000000000000000
[ 5141.981525] kernel: R13: 000055b4ddb288c0 R14: 7fffffffffffffff R15: 0000000000000010

marc-hb avatar Aug 05 '24 16:08 marc-hb

Not seen in PR/daily tests for a week, closing.

kv2019i avatar Dec 13 '24 11:12 kv2019i