linux check-suspend-resume-with-playback failed with workqueue lockup error

Found this error from weekend stress test in LNLM_RVP_HDA.

The command to reproduce: TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg MODEL=LNLM_RVP_HDA SOF_TEST_INTERVAL=5 ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 100 -m playback

The error message is this,

===========================>>
[ 1843.893004] kernel: BUG: workqueue lockup - pool cpus=0-7 node=0 flags=0x4 nice=0 stuck for 54s!
<<===========================

Input/output error already happened in 4th loop

2024-07-20 16:15:50 UTC Sub-Test: [REMOTE_INFO] ===== Round(4/100) =====
2024-07-20 16:15:50 UTC Sub-Test: [REMOTE_COMMAND] Run the command: rtcwake -m mem -s 5
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Sat Jul 20 16:15:56 2024
2024-07-20 16:15:56 UTC Sub-Test: [REMOTE_COMMAND] sleep for 5
aplay: pcm_write:2127: write error: Input/output error

kernel lockup happened in 25th loop

2024-07-20 16:20:07 UTC Sub-Test: [REMOTE_INFO] ===== Round(25/100) =====
2024-07-20 16:20:07 UTC Sub-Test: [REMOTE_COMMAND] Run the command: rtcwake -m mem -s 5
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Sat Jul 20 16:20:13 2024
2024-07-20 16:21:58 UTC Sub-Test: [REMOTE_COMMAND] sleep for 5
2024-07-20 16:22:06 UTC Sub-Test: [REMOTE_INFO] Check for the kernel log status
declare -- cmd="journalctl_cmd --since=@1721492406"
journalctl_cmd is a function
journalctl_cmd () 
{ 
    sudo journalctl -k -q --no-pager --utc --output=short-monotonic --no-hostname "$@"
}
2024-07-20 16:22:06 UTC [ERROR] Caught kernel log error
===========================>>
[ 1843.893004] kernel: BUG: workqueue lockup - pool cpus=0-7 node=0 flags=0x4 nice=0 stuck for 54s!
<<===========================

I don't remember this error lately, reproduction rate will be very low.

Intel Internal test result link: planresultdetail/44120?model=LNLM_RVP_HDA&testcase=check-suspend-resume-with-playback-100

Jul 22 '24 17:07 fredoh9

dmesg is not available @fredoh9 ?

Jul 22 '24 17:07 plbossart

right, dmesg was not captured in this case. I'm looking at the issue logs are not collected in some failure cases also.

I will use journalctl and upload equivalent dmesg when I have the access to the device.

Jul 22 '24 19:07 fredoh9

dmesg for the failure boot from ba-lnlm-rvp-hda-02. dmesg_20240720_stress_test.txt

Jul 20 16:21:58.904948 kernel: snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-lnl 0000:00:1f.3: Current DSP power state: D3
Jul 20 16:21:58.905276 kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-lnl 0000:00:1f.3: fw_state change: 7 -> 0
Jul 20 16:21:58.905612 kernel: ACPI: EC: interrupt blocked
Jul 20 16:21:58.905671 kernel: ACPI: EC: interrupt unblocked
Jul 20 16:21:58.905713 kernel: BUG: workqueue lockup - pool cpus=0-7 node=0 flags=0x4 nice=0 stuck for 54s!
Jul 20 16:21:58.905922 kernel: Showing busy workqueues and worker pools:
Jul 20 16:21:58.952292 kernel: workqueue events: flags=0x0
Jul 20 16:21:58.952375 kernel:   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
Jul 20 16:21:58.952417 kernel:     pending: vmstat_shepherd
Jul 20 16:21:58.952457 kernel: workqueue events_unbound: flags=0x2
Jul 20 16:21:58.952496 kernel:   pwq 33: cpus=0-7 node=0 flags=0x4 nice=0 active=1 refcnt=2
Jul 20 16:21:58.952542 kernel:     pending: crng_reseed

Jul 22 '24 19:07 fredoh9

Not seen in recent weeks of LNL daily tests, closing.

Dec 16 '24 16:12 kv2019i