linux icon indicating copy to clipboard operation
linux copied to clipboard

[BUG][LNL] call trace happens when testing multiple pipeline

Open keqiaozhang opened this issue 2 years ago • 15 comments

Describe the bug Observed this issue on LNLM-HDA platform. It happens when testing multiple pipeline. The reproduction rate is about 100% From the console log, it failed to kill audio(playback/capture) threads.

dmesg

[ 3795.793458] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x12010000|0x0: GLB_DELETE_PIPELINE
[ 3795.793907] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
[ 3795.794001] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x12010000|0x0: GLB_DELETE_PIPELINE
[ 3795.794011] snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget pipeline.2 freed
[ 3795.794014] snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget dai-copier.HDA.Analog.playback freed
[ 3795.794166] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
[ 3796.701668] rfkill: input handler enabled
[ 3800.280963] rfkill: input handler disabled
[ 3933.120665] INFO: task aplay:5995 blocked for more than 122 seconds.
[ 3933.120676]       Not tainted 6.6.0-rc1-g31685a688828 #dev
[ 3933.120679] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3933.120681] task:aplay           state:D stack:13336 pid:5995  ppid:5991   flags:0x00000006
[ 3933.120688] Call Trace:
[ 3933.120691]  <TASK>
[ 3933.120697]  __schedule+0x303/0xaa0
[ 3933.120706]  ? local_clock_noinstr+0xd/0xb0
[ 3933.120713]  schedule+0x5b/0xa0
[ 3933.120716]  rpm_resume+0x16d/0x7b0
[ 3933.120722]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 3933.120730]  __pm_runtime_resume+0x4e/0x90
[ 3933.120737]  snd_soc_pcm_component_pm_runtime_get+0x33/0xd0 [snd_soc_core]
[ 3933.120768]  __soc_pcm_open+0x68/0x5a0 [snd_soc_core]
[ 3933.120790]  ? lock_release+0x25c/0x3b0
[ 3933.120795]  dpcm_be_dai_startup+0x146/0x330 [snd_soc_core]
[ 3933.120820]  dpcm_fe_dai_open+0xdf/0x900 [snd_soc_core]
[ 3933.120840]  ? debug_mutex_init+0x37/0x50
[ 3933.120847]  snd_pcm_open_substream+0x54e/0x8b0 [snd_pcm]
[ 3933.120865]  snd_pcm_open.part.0+0xd3/0x210 [snd_pcm]
[ 3933.120877]  ? __pfx_default_wake_function+0x10/0x10
[ 3933.120885]  snd_pcm_playback_open+0x59/0x90 [snd_pcm]
[ 3933.120896]  ? __pfx_chrdev_open+0x10/0x10
[ 3933.120900]  chrdev_open+0xc9/0x250
[ 3933.120905]  do_dentry_open+0x177/0x550
[ 3933.120910]  path_openat+0x765/0xa40
[ 3933.120914]  ? local_clock_noinstr+0xd/0xb0
[ 3933.120919]  do_filp_open+0xb6/0x160
[ 3933.120926]  ? local_clock+0x15/0x30
[ 3933.120929]  ? lock_release+0x25c/0x3b0
[ 3933.120933]  ? _raw_spin_unlock+0x23/0x40
[ 3933.120940]  do_sys_openat2+0x96/0xd0
[ 3933.120946]  __x64_sys_openat+0x57/0xa0
[ 3933.120950]  do_syscall_64+0x3c/0x90
[ 3933.120954]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 3933.120958] RIP: 0033:0x7f5f32f9d72b
[ 3933.120961] RSP: 002b:00007ffe7a735ca0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 3933.120965] RAX: ffffffffffffffda RBX: 0000000000080802 RCX: 00007f5f32f9d72b
[ 3933.120967] RDX: 0000000000080802 RSI: 00007ffe7a735d40 RDI: 00000000ffffff9c
[ 3933.120969] RBP: 00007ffe7a735d40 R08: 0000000000000000 R09: 00007ffe7a735ab7
[ 3933.120970] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000080802
[ 3933.120972] R13: 00007f5f33173eb2 R14: 00007ffe7a735d40 R15: 0000000081204101
[ 3933.120980]  </TASK>

**console output **

2023-10-31 02:14:32 UTC [REMOTE_INFO] ===== Testing: (Loop: 32/50) =====
2023-10-31 02:14:32 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ace-tplg/sof-hda-generic-4ch.tplg to run the test case
2023-10-31 02:14:32 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2023-10-31 02:14:32 UTC [REMOTE_INFO] Run command to get pipeline parameters
2023-10-31 02:14:32 UTC [REMOTE_COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ace-tplg/sof-hda-generic-4ch.tplg -f 'type:playback  & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2023-10-31 02:14:32 UTC [REMOTE_INFO] Testing: HDA Analog [hw:0,0]
2023-10-31 02:14:32 UTC [REMOTE_COMMAND] aplay   -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2023-10-31 02:14:32 UTC [REMOTE_INFO] Testing: Deepbuffer HDA Analog [hw:0,31]
2023-10-31 02:14:32 UTC [REMOTE_COMMAND] aplay   -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
2023-10-31 02:14:32 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ace-tplg/sof-hda-generic-4ch.tplg to run the test case
2023-10-31 02:14:32 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2023-10-31 02:14:32 UTC [REMOTE_INFO] Run command to get pipeline parameters
2023-10-31 02:14:32 UTC [REMOTE_COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ace-tplg/sof-hda-generic-4ch.tplg -f 'type:capture & ~pcm:Amplifier Reference & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2023-10-31 02:14:32 UTC [REMOTE_INFO] Testing: HDA Analog [hw:0,0]
2023-10-31 02:14:32 UTC [REMOTE_COMMAND] arecord   -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/null -q
2023-10-31 02:14:32 UTC [REMOTE_INFO] Testing: DMIC Raw [hw:0,6]
2023-10-31 02:14:32 UTC [REMOTE_COMMAND] arecord   -D hw:0,6 -c 4 -r 48000 -f S32_LE /dev/null -q
2023-10-31 02:14:32 UTC [REMOTE_INFO] wait 0.5s for aplay_opts()
2023-10-31 02:14:33 UTC [REMOTE_INFO] checking pipeline status
2023-10-31 02:14:33 UTC [REMOTE_ERROR] Running process count is 8, but 4 is expected
8349 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
8357 aplay -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
8485 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
8492 aplay -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
8376 arecord -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/null -q
8385 arecord -D hw:0,6 -c 4 -r 48000 -f S32_LE /dev/null -q
8512 arecord -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/null -q
8520 arecord -D hw:0,6 -c 4 -r 48000 -f S32_LE /dev/null -q
2023-10-31 02:14:33 UTC [REMOTE_INFO] Starting func_exit_handler(1)
2023-10-31 02:14:33 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2023-10-31 02:14:33 UTC [REMOTE_ERROR]  func_error_exit()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh
2023-10-31 02:14:33 UTC [REMOTE_ERROR]  ps_checks()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:146
2023-10-31 02:14:33 UTC [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:193
2023-10-31 02:14:33 UTC [REMOTE_INFO] pkill -TERM -f mtrace-reader.py
2023-10-31 02:14:33 UTC [REMOTE_INFO] nlines=10751 /home/ubuntu/sof-test/logs/multiple-pipeline/2023-10-31-02:10:45-16322/mtrace.txt
2023-10-31 02:14:33 UTC [REMOTE_INFO] ktime=408 sof-test PID=3606: ending
2023-10-31 02:14:33 UTC [REMOTE_INFO] Test Result: FAIL!

To Reproduce ~/sof-test/test-case/multiple-pipeline.sh -f a -c 20 -l 50

Reproduction Rate 100%

Environment

  1. Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
  2. Name of the topology file
    • Topology: {sof-ace-tplg/sof-hda-generic-4ch.tplg}
  3. Name of the platform(s) on which the bug is observed.
    • Platform: {LNL-HDA with Gen3.1 AIOC connected}

dmesg.txt mtrace.txt

keqiaozhang avatar Oct 31 '23 09:10 keqiaozhang

@keqiaozhang the description above refers to LNL and MTL, which of the two is impacted? both?

plbossart avatar Oct 31 '23 23:10 plbossart

@keqiaozhang the description above refers to LNL and MTL, which of the two is impacted? both?

Sorry for the typo, this is LNL specific issue. I have updated it in description.

keqiaozhang avatar Nov 01 '23 01:11 keqiaozhang

Can you try to disable this PCM device "Deepbuffer HDA Analog " to see if we have the problem?

i see this sequence in the trace

[ 3933.120795]  dpcm_be_dai_startup+0x146/0x330 [snd_soc_core]
[ 3933.120820]  dpcm_fe_dai_open+0xdf/0x900 [snd_soc_core]

And I wonder if this is a race in the BE startup/shutdown.

plbossart avatar Nov 02 '23 14:11 plbossart

Can you try to disable this PCM device "Deepbuffer HDA Analog " to see if we have the problem?

i see this sequence in the trace

[ 3933.120795]  dpcm_be_dai_startup+0x146/0x330 [snd_soc_core]
[ 3933.120820]  dpcm_fe_dai_open+0xdf/0x900 [snd_soc_core]

And I wonder if this is a race in the BE startup/shutdown.

but that would be visible on other platforms also, like on upx-i11.

ujfalusi avatar Nov 03 '23 11:11 ujfalusi

Too bad that the full dmesg is not available in CI (34110, 34168), but it appears that neither aplay or arecord terminates with the pkill -9 in the previous iteration. How the second set of aplay/arecord can start when we still have the same set using the PCMs is puzzling. My guess would be some sort of race / lockup in pm_runtime, but why this only affects LNL is a mystery.

ujfalusi avatar Nov 03 '23 12:11 ujfalusi

I have managed to reproduce the issue on the CI device. The test case fails, the kernel log shows that both PCM (:0,0 and :0,31) is closed:

[ 1641.239184] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
[ 1641.241709] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0

at this point:

$ ps aux | grep aplay
ubuntu      7420  0.0  0.0  10376  3456 pts/1    D    13:22   0:00 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
ubuntu      7429  0.0  0.0  10376  3456 pts/1    D    13:22   0:00 aplay -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
ubuntu      7526  0.0  0.0  10376  3328 pts/1    D    13:23   0:00 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
ubuntu      7535  0.0  0.0  10376  3328 pts/1    D    13:23   0:00 aplay -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q

Then some time later the lockdep print starts:

[ 1844.236958] INFO: task aplay:7420 blocked for more than 122 seconds.

repeats few times then on the terminal where the test was run:

/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 723:  7420 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 723:  7526 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 723:  7535 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 723:  7429 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"

Indeed ps aux | grep aplay returns empty and in dmesg:

[ 2115.251676] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 0 dir 0
[ 2115.251699] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period min 192 max 16384 bytes
[ 2115.251710] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period count 2 max 16
[ 2115.251718] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: buffer max 65536 bytes
[ 2115.252021] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 31 dir 0
[ 2115.252044] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period min 192 max 65535 bytes
[ 2115.252056] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period count 2 max 16
[ 2115.252066] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: buffer max 262144 bytes
[ 2115.252448] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0
[ 2115.252900] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
[ 2346.479971] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 2346.480449] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x64000000|0x3060004c: MOD_LARGE_CONFIG_SET
[ 2346.480705] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 2346.480752] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 2346.654464] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[ 2346.654667] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 2346.654777] snd_sof_intel_hda_common:mtl_enable_interrupts: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x1800]=0x0 successful
[ 2346.654805] snd_sof_intel_hda_common:mtl_enable_interrupts: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x1140]=0x0 successful
[ 2346.655978] snd_sof_intel_hda_common:mtl_dsp_core_power_down: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x178d04]=0x2000000 successful
[ 2346.657100] snd_sof_intel_hda_common:mtl_power_down_dsp: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x1000]=0x0 successful
[ 2346.657350] snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: Debug PCIR: 00000000 at  00000044
[ 2346.657372] snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: Debug PCIW: 00000010 at  00000044
[ 2346.658532] snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-lnl 0000:00:1f.3: Current DSP power state: D3
[ 2346.658546] snd_sof:sof_set_fw_state: sof-audio-pci-intel-lnl 0000:00:1f.3: fw_state change: 7 -> 0

Things are back to normal.

the full dmesg: dmesg-stuck.txt

ujfalusi avatar Nov 03 '23 13:11 ujfalusi

interesting... If I break out from the test to avoid the second batch of aplay to be started with:

diff --git a/test-case/multiple-pipeline.sh b/test-case/multiple-pipeline.sh
index d075ba166f27..f7bb93e59d56 100755
--- a/test-case/multiple-pipeline.sh
+++ b/test-case/multiple-pipeline.sh
@@ -204,6 +204,15 @@ do
     pkill -9 aplay   || true
     sleep 1 # try not to pollute the next iteration
 
+    if [ "$(pidof aplay arecord)" ]; then
+        dloge "Still have process not finished!!!"
+        # list aplay/arecord processes
+        pgrep -a -f aplay || true
+        pgrep -a -f arecord || true
+
+        exit 1
+    fi
+
     # check kernel log for each iteration to catch issues
     sof-kernel-log-check.sh "$KERNEL_CHECKPOINT" || die "Caught error in kernel log"
 done

There is no lock warning in the dmesg but the aplays are stuck for several minutes then they terminate.

[ 4220.026449] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
...
[ 4220.028512] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0

at this point:

$ ps aux | grep aplay
ubuntu     10930  0.0  0.0  10376  3456 pts/0    D    14:05   0:00 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
ubuntu     10938  0.0  0.0  10376  3328 pts/0    D    14:05   0:00 aplay -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
ubuntu     11066  0.0  0.0   9076  2432 pts/0    R+   14:06   0:00 grep --color=auto aplay

and several minutes later

/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 723: 10938 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 723: 10930 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"

and dmesg:

[ 4694.101230] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 0 dir 0
[ 4694.101237] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period min 192 max 16384 bytes
[ 4694.101240] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period count 2 max 16
[ 4694.101242] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: buffer max 65536 bytes
[ 4694.101634] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 31 dir 0
[ 4694.101658] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period min 192 max 65535 bytes
[ 4694.101668] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period count 2 max 16
[ 4694.101676] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: buffer max 262144 bytes
[ 4694.101704] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0
[ 4694.102374] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0

the full dmesg: dmesg-stuck-2.txt

ujfalusi avatar Nov 03 '23 14:11 ujfalusi

The only thing that really changed for LNL is the addition of 'eml_lock' to protect access to shared registers in the multi-link structures. However it's not supposed to be used for HDaudio since it's considered legacy.

plbossart avatar Nov 03 '23 15:11 plbossart

In think it is something PM related and I think it is not really in the sound (SOF) stack, but I can be completely wrong. I have tried to get an ftrace but that is hard to analyze (10.7 million lines...), but let's see this dmesg: dmesg-stuck-3.txt

It captures three iterations: two passing and the final failing. passing 1:

[  791.545280] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 0 dir 0
[  791.545505] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 31 dir 0
...
[  797.129152] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
[  797.131492] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0

passing 2:

[  798.966107] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 0 dir 0
[  798.966289] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 31 dir 0
...
[  804.541873] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0
[  804.544565] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0

The third iteration get's stuck and we have the blocked task dumps starting at 984.105572:

[  984.105572] INFO: task aplay:6226 blocked for more than 122 seconds.
[  984.105590]       Not tainted 6.6.0-rc1-daily-default-20231105-0-g09138cb33833 #dev
[  984.105598] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  984.105606] task:aplay           state:D stack:12872 pid:6226  ppid:6222   flags:0x00000006
[  984.105621] Call Trace:
[  984.105627]  <TASK>
[  984.105645]  __schedule+0x303/0xaa0
[  984.105681]  schedule+0x5b/0xa0
[  984.105692]  rpm_resume+0x16d/0x7b0
[  984.105706]  ? __pfx_autoremove_wake_function+0x10/0x10
[  984.105731]  __pm_runtime_resume+0x4e/0x90
[  984.105748]  snd_soc_pcm_component_pm_runtime_get+0x33/0xd0 [snd_soc_core]
[  984.105793]  __soc_pcm_open+0x68/0x5a0 [snd_soc_core]
[  984.105833]  dpcm_be_dai_startup+0x146/0x330 [snd_soc_core]
[  984.105878]  dpcm_fe_dai_open+0xdf/0x900 [snd_soc_core]
[  984.105910]  ? __pfx_dpcm_fe_dai_open+0x10/0x10 [snd_soc_core]
[  984.105953]  snd_pcm_open_substream+0x54e/0x8b0 [snd_pcm]
[  984.105987]  snd_pcm_open.part.0+0xd3/0x210 [snd_pcm]
[  984.106011]  ? __pfx_default_wake_function+0x10/0x10
[  984.106035]  snd_pcm_playback_open+0x59/0x90 [snd_pcm]
[  984.106053]  ? __pfx_chrdev_open+0x10/0x10
[  984.106063]  chrdev_open+0xc9/0x250
[  984.106082]  do_dentry_open+0x177/0x550
[  984.106100]  path_openat+0x765/0xa40
[  984.106129]  do_filp_open+0xb6/0x160
[  984.106165]  ? do_raw_spin_unlock+0x9/0xa0
[  984.106200]  do_sys_openat2+0x96/0xd0
[  984.106220]  __x64_sys_openat+0x57/0xa0
[  984.106235]  do_syscall_64+0x3c/0x90
[  984.106247]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8

since rpm_resume is stuck this aplay is keeping the card->pcm_mutex, the other aplay is waiting on it. After the third dump, suddenly the PCMs got opened and closed right away (we have sent the kill signal to them in the test script):

[ 1278.769754] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 0 dir 0
[ 1278.769793] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period min 192 max 16384 bytes
[ 1278.769820] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period count 2 max 16
[ 1278.769849] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: buffer max 65536 bytes
[ 1278.770227] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: open stream 31 dir 0
[ 1278.770244] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period min 192 max 65535 bytes
[ 1278.770251] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: period count 2 max 16
[ 1278.770257] snd_sof:sof_pcm_open: sof-audio-pci-intel-lnl 0000:00:1f.3: buffer max 262144 bytes
[ 1278.771320] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 31 dir 0
[ 1278.772885] snd_sof:sof_pcm_close: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: close stream 0 dir 0

But the card is not powered down for some reason. If I would start a single playback at this point, it will be also 'got stuck' for some time before able to continue! Waiting about 3-4 minutes and the pm_runtime finally runs the runtime suspend on the card:

[ 1509.816261] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 1509.816986] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x64000000|0x3060004c: MOD_LARGE_CONFIG_SET
[ 1509.817390] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 1509.817463] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 1509.990616] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[ 1509.990875] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 1509.991017] snd_sof_intel_hda_common:mtl_enable_interrupts: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x1800]=0x0 successful
[ 1509.991070] snd_sof_intel_hda_common:mtl_enable_interrupts: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x1140]=0x0 successful
[ 1509.992393] snd_sof_intel_hda_common:mtl_dsp_core_power_down: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x178d04]=0x2000000 successful
[ 1509.993635] snd_sof_intel_hda_common:mtl_power_down_dsp: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x1000]=0x0 successful
[ 1509.993945] snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: Debug PCIR: 00000000 at  00000044
[ 1509.994027] snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: Debug PCIW: 00000010 at  00000044
[ 1509.994865] snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-lnl 0000:00:1f.3: Current DSP power state: D3
[ 1509.994884] snd_sof:sof_set_fw_state: sof-audio-pci-intel-lnl 0000:00:1f.3: fw_state change: 7 -> 0

From here the card is usable again until the point when similar lockup happens.

Before the lockup I don't see any indication that runtime pm would run on the SOF card (we have 3 sec autosuspend), there would be indications in the log if it has been called - this kind of rules out the synchronize_irq() which I have suspected up until now.

I don't see why this only happens on LNL - HDA path is identical on TGL, MTL. I will try to get some usable ftrace out, but the time is too long and if I capture it, it is filled with huge amount of distracting stuff

ujfalusi avatar Nov 06 '23 12:11 ujfalusi

I got an ftrace log (11GB - 123 million lines), not easy to work with but it looks like that the pm_runtime is the one which is blocking:

           <...>-5820    [006] ...1.   508.248142: snd_soc_pcm_component_pm_runtime_get <-__soc_pcm_open
...
           <...>-5820    [006] ...1.   508.248147: schedule <-rpm_resume
...
           <...>-5820    [000] ...1.   980.872899: snd_soc_component_module_get <-__soc_pcm_open

I still don't see what is blocking, the trace is huge...

ujfalusi avatar Nov 06 '23 15:11 ujfalusi

~~One of the differences between LNL and previous platforms is that we power-up only the links that are active. In previous generations, we powered-up all links when the first one was active and de-powered all links when the last one became inactive.~~

~~The net result is that the resume time should be non-existent in LNL- devices because codecs are already synchronized. For LNL we have a much longer time and it's possible that this exposes a race condition leading to a deadlock of sorts.~~

Edit: never mind the comment above was for SoundWire and this is an HDAudio bug. Monday morning huh.

plbossart avatar Nov 06 '23 15:11 plbossart

this is where we lock up:

[  465.477593] snd_soc_core:snd_soc_pcm_component_pm_runtime_get: snd_hda_codec_realtek ehdaudio0D0: ASoC: before pm_runtime_get_sync() for ehdaudio0D0

in dpcm_fe_dai_open() -> dpcm_fe_dai_startup() -> dpcm_be_dai_startup() -> __soc_pcm_open() -> snd_soc_pcm_component_pm_runtime_get()

Full dmesg and test case output: dmesg-stuck-5.txt test-run-5.txt

ujfalusi avatar Nov 07 '23 08:11 ujfalusi

The hda_call_codec_suspend() is locking up, apparently in alc225_shutup() and also in hda_set_power_state(codec, AC_PWRST_D3) Test run log, dmesg with the kernel built from here: https://github.com/ujfalusi/sof-linux/commits/peter/sof/lnl-debug

dmesg-7.txt test-run-7.txt

The ftrace log is left on the CI machine for reference.

ujfalusi avatar Nov 07 '23 13:11 ujfalusi

This issue cannot be reproduced after BIOS upgrade. Closing this bug.

keqiaozhang avatar Dec 25 '23 01:12 keqiaozhang

This issue happened again in CI daily test: https://sof-ci.ostc.intel.com/#/result/planresultdetail/36608?model=LNLM_RVP_HDA&testcase=multiple-pipeline-capture-50

keqiaozhang avatar Jan 09 '24 08:01 keqiaozhang

@ujfalusi is this fixed? I haven't seen this for weeks.

marc-hb avatar May 01 '24 15:05 marc-hb