[BUG][LNL] call trace happens when testing multiple pipeline
Describe the bug Observed this issue on LNLM-HDA platform. It happens when testing multiple pipeline. The reproduction rate is about 100% From the console log, it failed to kill audio(playback/capture) threads.
dmesg
[ 3795.793458] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx : 0x12010000|0x0: GLB_DELETE_PIPELINE
[ 3795.793907] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
[ 3795.794001] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x12010000|0x0: GLB_DELETE_PIPELINE
[ 3795.794011] snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget pipeline.2 freed
[ 3795.794014] snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget dai-copier.HDA.Analog.playback freed
[ 3795.794166] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
[ 3796.701668] rfkill: input handler enabled
[ 3800.280963] rfkill: input handler disabled
[ 3933.120665] INFO: task aplay:5995 blocked for more than 122 seconds.
[ 3933.120676] Not tainted 6.6.0-rc1-g31685a688828 #dev
[ 3933.120679] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3933.120681] task:aplay state:D stack:13336 pid:5995 ppid:5991 flags:0x00000006
[ 3933.120688] Call Trace:
[ 3933.120691] <TASK>
[ 3933.120697] __schedule+0x303/0xaa0
[ 3933.120706] ? local_clock_noinstr+0xd/0xb0
[ 3933.120713] schedule+0x5b/0xa0
[ 3933.120716] rpm_resume+0x16d/0x7b0
[ 3933.120722] ? __pfx_autoremove_wake_function+0x10/0x10
[ 3933.120730] __pm_runtime_resume+0x4e/0x90
[ 3933.120737] snd_soc_pcm_component_pm_runtime_get+0x33/0xd0 [snd_soc_core]
[ 3933.120768] __soc_pcm_open+0x68/0x5a0 [snd_soc_core]
[ 3933.120790] ? lock_release+0x25c/0x3b0
[ 3933.120795] dpcm_be_dai_startup+0x146/0x330 [snd_soc_core]
[ 3933.120820] dpcm_fe_dai_open+0xdf/0x900 [snd_soc_core]
[ 3933.120840] ? debug_mutex_init+0x37/0x50
[ 3933.120847] snd_pcm_open_substream+0x54e/0x8b0 [snd_pcm]
[ 3933.120865] snd_pcm_open.part.0+0xd3/0x210 [snd_pcm]
[ 3933.120877] ? __pfx_default_wake_function+0x10/0x10
[ 3933.120885] snd_pcm_playback_open+0x59/0x90 [snd_pcm]
[ 3933.120896] ? __pfx_chrdev_open+0x10/0x10
[ 3933.120900] chrdev_open+0xc9/0x250
[ 3933.120905] do_dentry_open+0x177/0x550
[ 3933.120910] path_openat+0x765/0xa40
[ 3933.120914] ? local_clock_noinstr+0xd/0xb0
[ 3933.120919] do_filp_open+0xb6/0x160
[ 3933.120926] ? local_clock+0x15/0x30
[ 3933.120929] ? lock_release+0x25c/0x3b0
[ 3933.120933] ? _raw_spin_unlock+0x23/0x40
[ 3933.120940] do_sys_openat2+0x96/0xd0
[ 3933.120946] __x64_sys_openat+0x57/0xa0
[ 3933.120950] do_syscall_64+0x3c/0x90
[ 3933.120954] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 3933.120958] RIP: 0033:0x7f5f32f9d72b
[ 3933.120961] RSP: 002b:00007ffe7a735ca0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 3933.120965] RAX: ffffffffffffffda RBX: 0000000000080802 RCX: 00007f5f32f9d72b
[ 3933.120967] RDX: 0000000000080802 RSI: 00007ffe7a735d40 RDI: 00000000ffffff9c
[ 3933.120969] RBP: 00007ffe7a735d40 R08: 0000000000000000 R09: 00007ffe7a735ab7
[ 3933.120970] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000080802
[ 3933.120972] R13: 00007f5f33173eb2 R14: 00007ffe7a735d40 R15: 0000000081204101
[ 3933.120980] </TASK>
**console output **
2023-10-31 02:14:32 UTC [REMOTE_INFO] ===== Testing: (Loop: 32/50) =====
2023-10-31 02:14:32 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ace-tplg/sof-hda-generic-4ch.tplg to run the test case
2023-10-31 02:14:32 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2023-10-31 02:14:32 UTC [REMOTE_INFO] Run command to get pipeline parameters
2023-10-31 02:14:32 UTC [REMOTE_COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ace-tplg/sof-hda-generic-4ch.tplg -f 'type:playback & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2023-10-31 02:14:32 UTC [REMOTE_INFO] Testing: HDA Analog [hw:0,0]
2023-10-31 02:14:32 UTC [REMOTE_COMMAND] aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2023-10-31 02:14:32 UTC [REMOTE_INFO] Testing: Deepbuffer HDA Analog [hw:0,31]
2023-10-31 02:14:32 UTC [REMOTE_COMMAND] aplay -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
2023-10-31 02:14:32 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ace-tplg/sof-hda-generic-4ch.tplg to run the test case
2023-10-31 02:14:32 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2023-10-31 02:14:32 UTC [REMOTE_INFO] Run command to get pipeline parameters
2023-10-31 02:14:32 UTC [REMOTE_COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ace-tplg/sof-hda-generic-4ch.tplg -f 'type:capture & ~pcm:Amplifier Reference & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2023-10-31 02:14:32 UTC [REMOTE_INFO] Testing: HDA Analog [hw:0,0]
2023-10-31 02:14:32 UTC [REMOTE_COMMAND] arecord -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/null -q
2023-10-31 02:14:32 UTC [REMOTE_INFO] Testing: DMIC Raw [hw:0,6]
2023-10-31 02:14:32 UTC [REMOTE_COMMAND] arecord -D hw:0,6 -c 4 -r 48000 -f S32_LE /dev/null -q
2023-10-31 02:14:32 UTC [REMOTE_INFO] wait 0.5s for aplay_opts()
2023-10-31 02:14:33 UTC [REMOTE_INFO] checking pipeline status
2023-10-31 02:14:33 UTC [REMOTE_ERROR] Running process count is 8, but 4 is expected
8349 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
8357 aplay -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
8485 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
8492 aplay -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
8376 arecord -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/null -q
8385 arecord -D hw:0,6 -c 4 -r 48000 -f S32_LE /dev/null -q
8512 arecord -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/null -q
8520 arecord -D hw:0,6 -c 4 -r 48000 -f S32_LE /dev/null -q
2023-10-31 02:14:33 UTC [REMOTE_INFO] Starting func_exit_handler(1)
2023-10-31 02:14:33 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2023-10-31 02:14:33 UTC [REMOTE_ERROR] func_error_exit() @ /home/ubuntu/sof-test/test-case/multiple-pipeline.sh
2023-10-31 02:14:33 UTC [REMOTE_ERROR] ps_checks() @ /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:146
2023-10-31 02:14:33 UTC [REMOTE_ERROR] main() @ /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:193
2023-10-31 02:14:33 UTC [REMOTE_INFO] pkill -TERM -f mtrace-reader.py
2023-10-31 02:14:33 UTC [REMOTE_INFO] nlines=10751 /home/ubuntu/sof-test/logs/multiple-pipeline/2023-10-31-02:10:45-16322/mtrace.txt
2023-10-31 02:14:33 UTC [REMOTE_INFO] ktime=408 sof-test PID=3606: ending
2023-10-31 02:14:33 UTC [REMOTE_INFO] Test Result: FAIL!
To Reproduce ~/sof-test/test-case/multiple-pipeline.sh -f a -c 20 -l 50
Reproduction Rate 100%
Environment
- Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
- Kernel: {31685a688828}
- SOF: {165e68fdaff8}
- Name of the topology file
- Topology: {sof-ace-tplg/sof-hda-generic-4ch.tplg}
- Name of the platform(s) on which the bug is observed.
- Platform: {LNL-HDA with Gen3.1 AIOC connected}
@keqiaozhang the description above refers to LNL and MTL, which of the two is impacted? both?
@keqiaozhang the description above refers to LNL and MTL, which of the two is impacted? both?
Sorry for the typo, this is LNL specific issue. I have updated it in description.
Can you try to disable this PCM device "Deepbuffer HDA Analog " to see if we have the problem?
i see this sequence in the trace
[ 3933.120795] dpcm_be_dai_startup+0x146/0x330 [snd_soc_core]
[ 3933.120820] dpcm_fe_dai_open+0xdf/0x900 [snd_soc_core]
And I wonder if this is a race in the BE startup/shutdown.
Can you try to disable this PCM device "Deepbuffer HDA Analog " to see if we have the problem?
i see this sequence in the trace
[ 3933.120795] dpcm_be_dai_startup+0x146/0x330 [snd_soc_core] [ 3933.120820] dpcm_fe_dai_open+0xdf/0x900 [snd_soc_core]And I wonder if this is a race in the BE startup/shutdown.
but that would be visible on other platforms also, like on upx-i11.
Too bad that the full dmesg is not available in CI (34110, 34168), but it appears that neither aplay or arecord terminates with the pkill -9 in the previous iteration. How the second set of aplay/arecord can start when we still have the same set using the PCMs is puzzling.
My guess would be some sort of race / lockup in pm_runtime, but why this only affects LNL is a mystery.
I have managed to reproduce the issue on the CI device. The test case fails, the kernel log shows that both PCM (:0,0 and :0,31) is closed:
[ 1641.239184] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
[ 1641.241709] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0
at this point:
$ ps aux | grep aplay
ubuntu 7420 0.0 0.0 10376 3456 pts/1 D 13:22 0:00 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
ubuntu 7429 0.0 0.0 10376 3456 pts/1 D 13:22 0:00 aplay -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
ubuntu 7526 0.0 0.0 10376 3328 pts/1 D 13:23 0:00 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
ubuntu 7535 0.0 0.0 10376 3328 pts/1 D 13:23 0:00 aplay -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
Then some time later the lockdep print starts:
[ 1844.236958] INFO: task aplay:7420 blocked for more than 122 seconds.
repeats few times then on the terminal where the test was run:
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 723: 7420 Killed aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 723: 7526 Killed aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 723: 7535 Killed aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 723: 7429 Killed aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
Indeed ps aux | grep aplay returns empty and in dmesg:
[ 2115.251676] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 0 dir 0
[ 2115.251699] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period min 192 max 16384 bytes
[ 2115.251710] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period count 2 max 16
[ 2115.251718] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: buffer max 65536 bytes
[ 2115.252021] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 31 dir 0
[ 2115.252044] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period min 192 max 65535 bytes
[ 2115.252056] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period count 2 max 16
[ 2115.252066] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: buffer max 262144 bytes
[ 2115.252448] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0
[ 2115.252900] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
[ 2346.479971] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 2346.480449] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x64000000|0x3060004c: MOD_LARGE_CONFIG_SET
[ 2346.480705] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 2346.480752] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 2346.654464] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[ 2346.654667] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 2346.654777] snd_sof_intel_hda_common:mtl_enable_interrupts: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x1800]=0x0 successful
[ 2346.654805] snd_sof_intel_hda_common:mtl_enable_interrupts: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x1140]=0x0 successful
[ 2346.655978] snd_sof_intel_hda_common:mtl_dsp_core_power_down: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x178d04]=0x2000000 successful
[ 2346.657100] snd_sof_intel_hda_common:mtl_power_down_dsp: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x1000]=0x0 successful
[ 2346.657350] snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: Debug PCIR: 00000000 at 00000044
[ 2346.657372] snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: Debug PCIW: 00000010 at 00000044
[ 2346.658532] snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-lnl 0000:00:1f.3: Current DSP power state: D3
[ 2346.658546] snd_sof:sof_set_fw_state: sof-audio-pci-intel-lnl 0000:00:1f.3: fw_state change: 7 -> 0
Things are back to normal.
the full dmesg: dmesg-stuck.txt
interesting... If I break out from the test to avoid the second batch of aplay to be started with:
diff --git a/test-case/multiple-pipeline.sh b/test-case/multiple-pipeline.sh
index d075ba166f27..f7bb93e59d56 100755
--- a/test-case/multiple-pipeline.sh
+++ b/test-case/multiple-pipeline.sh
@@ -204,6 +204,15 @@ do
pkill -9 aplay || true
sleep 1 # try not to pollute the next iteration
+ if [ "$(pidof aplay arecord)" ]; then
+ dloge "Still have process not finished!!!"
+ # list aplay/arecord processes
+ pgrep -a -f aplay || true
+ pgrep -a -f arecord || true
+
+ exit 1
+ fi
+
# check kernel log for each iteration to catch issues
sof-kernel-log-check.sh "$KERNEL_CHECKPOINT" || die "Caught error in kernel log"
done
There is no lock warning in the dmesg but the aplays are stuck for several minutes then they terminate.
[ 4220.026449] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
...
[ 4220.028512] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0
at this point:
$ ps aux | grep aplay
ubuntu 10930 0.0 0.0 10376 3456 pts/0 D 14:05 0:00 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
ubuntu 10938 0.0 0.0 10376 3328 pts/0 D 14:05 0:00 aplay -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
ubuntu 11066 0.0 0.0 9076 2432 pts/0 R+ 14:06 0:00 grep --color=auto aplay
and several minutes later
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 723: 10938 Killed aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 723: 10930 Killed aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
and dmesg:
[ 4694.101230] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 0 dir 0
[ 4694.101237] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period min 192 max 16384 bytes
[ 4694.101240] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period count 2 max 16
[ 4694.101242] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: buffer max 65536 bytes
[ 4694.101634] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 31 dir 0
[ 4694.101658] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period min 192 max 65535 bytes
[ 4694.101668] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period count 2 max 16
[ 4694.101676] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: buffer max 262144 bytes
[ 4694.101704] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0
[ 4694.102374] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
the full dmesg: dmesg-stuck-2.txt
The only thing that really changed for LNL is the addition of 'eml_lock' to protect access to shared registers in the multi-link structures. However it's not supposed to be used for HDaudio since it's considered legacy.
In think it is something PM related and I think it is not really in the sound (SOF) stack, but I can be completely wrong. I have tried to get an ftrace but that is hard to analyze (10.7 million lines...), but let's see this dmesg: dmesg-stuck-3.txt
It captures three iterations: two passing and the final failing. passing 1:
[ 791.545280] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 0 dir 0
[ 791.545505] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 31 dir 0
...
[ 797.129152] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
[ 797.131492] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0
passing 2:
[ 798.966107] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 0 dir 0
[ 798.966289] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 31 dir 0
...
[ 804.541873] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0
[ 804.544565] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
The third iteration get's stuck and we have the blocked task dumps starting at 984.105572:
[ 984.105572] INFO: task aplay:6226 blocked for more than 122 seconds.
[ 984.105590] Not tainted 6.6.0-rc1-daily-default-20231105-0-g09138cb33833 #dev
[ 984.105598] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 984.105606] task:aplay state:D stack:12872 pid:6226 ppid:6222 flags:0x00000006
[ 984.105621] Call Trace:
[ 984.105627] <TASK>
[ 984.105645] __schedule+0x303/0xaa0
[ 984.105681] schedule+0x5b/0xa0
[ 984.105692] rpm_resume+0x16d/0x7b0
[ 984.105706] ? __pfx_autoremove_wake_function+0x10/0x10
[ 984.105731] __pm_runtime_resume+0x4e/0x90
[ 984.105748] snd_soc_pcm_component_pm_runtime_get+0x33/0xd0 [snd_soc_core]
[ 984.105793] __soc_pcm_open+0x68/0x5a0 [snd_soc_core]
[ 984.105833] dpcm_be_dai_startup+0x146/0x330 [snd_soc_core]
[ 984.105878] dpcm_fe_dai_open+0xdf/0x900 [snd_soc_core]
[ 984.105910] ? __pfx_dpcm_fe_dai_open+0x10/0x10 [snd_soc_core]
[ 984.105953] snd_pcm_open_substream+0x54e/0x8b0 [snd_pcm]
[ 984.105987] snd_pcm_open.part.0+0xd3/0x210 [snd_pcm]
[ 984.106011] ? __pfx_default_wake_function+0x10/0x10
[ 984.106035] snd_pcm_playback_open+0x59/0x90 [snd_pcm]
[ 984.106053] ? __pfx_chrdev_open+0x10/0x10
[ 984.106063] chrdev_open+0xc9/0x250
[ 984.106082] do_dentry_open+0x177/0x550
[ 984.106100] path_openat+0x765/0xa40
[ 984.106129] do_filp_open+0xb6/0x160
[ 984.106165] ? do_raw_spin_unlock+0x9/0xa0
[ 984.106200] do_sys_openat2+0x96/0xd0
[ 984.106220] __x64_sys_openat+0x57/0xa0
[ 984.106235] do_syscall_64+0x3c/0x90
[ 984.106247] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
since rpm_resume is stuck this aplay is keeping the card->pcm_mutex, the other aplay is waiting on it. After the third dump, suddenly the PCMs got opened and closed right away (we have sent the kill signal to them in the test script):
[ 1278.769754] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 0 dir 0
[ 1278.769793] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period min 192 max 16384 bytes
[ 1278.769820] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period count 2 max 16
[ 1278.769849] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: buffer max 65536 bytes
[ 1278.770227] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 31 dir 0
[ 1278.770244] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period min 192 max 65535 bytes
[ 1278.770251] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period count 2 max 16
[ 1278.770257] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: buffer max 262144 bytes
[ 1278.771320] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
[ 1278.772885] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0
But the card is not powered down for some reason. If I would start a single playback at this point, it will be also 'got stuck' for some time before able to continue! Waiting about 3-4 minutes and the pm_runtime finally runs the runtime suspend on the card:
[ 1509.816261] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 1509.816986] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x64000000|0x3060004c: MOD_LARGE_CONFIG_SET
[ 1509.817390] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 1509.817463] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 1509.990616] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[ 1509.990875] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 1509.991017] snd_sof_intel_hda_common:mtl_enable_interrupts: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x1800]=0x0 successful
[ 1509.991070] snd_sof_intel_hda_common:mtl_enable_interrupts: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x1140]=0x0 successful
[ 1509.992393] snd_sof_intel_hda_common:mtl_dsp_core_power_down: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x178d04]=0x2000000 successful
[ 1509.993635] snd_sof_intel_hda_common:mtl_power_down_dsp: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x1000]=0x0 successful
[ 1509.993945] snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: Debug PCIR: 00000000 at 00000044
[ 1509.994027] snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: Debug PCIW: 00000010 at 00000044
[ 1509.994865] snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-lnl 0000:00:1f.3: Current DSP power state: D3
[ 1509.994884] snd_sof:sof_set_fw_state: sof-audio-pci-intel-lnl 0000:00:1f.3: fw_state change: 7 -> 0
From here the card is usable again until the point when similar lockup happens.
Before the lockup I don't see any indication that runtime pm would run on the SOF card (we have 3 sec autosuspend), there would be indications in the log if it has been called - this kind of rules out the synchronize_irq() which I have suspected up until now.
I don't see why this only happens on LNL - HDA path is identical on TGL, MTL. I will try to get some usable ftrace out, but the time is too long and if I capture it, it is filled with huge amount of distracting stuff
I got an ftrace log (11GB - 123 million lines), not easy to work with but it looks like that the pm_runtime is the one which is blocking:
<...>-5820 [006] ...1. 508.248142: snd_soc_pcm_component_pm_runtime_get <-__soc_pcm_open
...
<...>-5820 [006] ...1. 508.248147: schedule <-rpm_resume
...
<...>-5820 [000] ...1. 980.872899: snd_soc_component_module_get <-__soc_pcm_open
I still don't see what is blocking, the trace is huge...
~~One of the differences between LNL and previous platforms is that we power-up only the links that are active. In previous generations, we powered-up all links when the first one was active and de-powered all links when the last one became inactive.~~
~~The net result is that the resume time should be non-existent in LNL- devices because codecs are already synchronized. For LNL we have a much longer time and it's possible that this exposes a race condition leading to a deadlock of sorts.~~
Edit: never mind the comment above was for SoundWire and this is an HDAudio bug. Monday morning huh.
this is where we lock up:
[ 465.477593] snd_soc_core:snd_soc_pcm_component_pm_runtime_get: snd_hda_codec_realtek ehdaudio0D0: ASoC: before pm_runtime_get_sync() for ehdaudio0D0
in dpcm_fe_dai_open() -> dpcm_fe_dai_startup() -> dpcm_be_dai_startup() -> __soc_pcm_open() -> snd_soc_pcm_component_pm_runtime_get()
Full dmesg and test case output: dmesg-stuck-5.txt test-run-5.txt
The hda_call_codec_suspend() is locking up, apparently in alc225_shutup() and also in hda_set_power_state(codec, AC_PWRST_D3)
Test run log, dmesg with the kernel built from here: https://github.com/ujfalusi/sof-linux/commits/peter/sof/lnl-debug
The ftrace log is left on the CI machine for reference.
This issue cannot be reproduced after BIOS upgrade. Closing this bug.
This issue happened again in CI daily test: https://sof-ci.ostc.intel.com/#/result/planresultdetail/36608?model=LNLM_RVP_HDA&testcase=multiple-pipeline-capture-50
@ujfalusi is this fixed? I haven't seen this for weeks.