check-suspend-resume-with-playback failed with workqueue lockup error
Found this error from weekend stress test in LNLM_RVP_HDA.
The command to reproduce: TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg MODEL=LNLM_RVP_HDA SOF_TEST_INTERVAL=5 ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 100 -m playback
The error message is this,
===========================>>
[ 1843.893004] kernel: BUG: workqueue lockup - pool cpus=0-7 node=0 flags=0x4 nice=0 stuck for 54s!
<<===========================
Input/output error already happened in 4th loop
2024-07-20 16:15:50 UTC Sub-Test: [REMOTE_INFO] ===== Round(4/100) =====
2024-07-20 16:15:50 UTC Sub-Test: [REMOTE_COMMAND] Run the command: rtcwake -m mem -s 5
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Sat Jul 20 16:15:56 2024
2024-07-20 16:15:56 UTC Sub-Test: [REMOTE_COMMAND] sleep for 5
aplay: pcm_write:2127: write error: Input/output error
kernel lockup happened in 25th loop
2024-07-20 16:20:07 UTC Sub-Test: [REMOTE_INFO] ===== Round(25/100) =====
2024-07-20 16:20:07 UTC Sub-Test: [REMOTE_COMMAND] Run the command: rtcwake -m mem -s 5
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Sat Jul 20 16:20:13 2024
2024-07-20 16:21:58 UTC Sub-Test: [REMOTE_COMMAND] sleep for 5
2024-07-20 16:22:06 UTC Sub-Test: [REMOTE_INFO] Check for the kernel log status
declare -- cmd="journalctl_cmd --since=@1721492406"
journalctl_cmd is a function
journalctl_cmd ()
{
sudo journalctl -k -q --no-pager --utc --output=short-monotonic --no-hostname "$@"
}
2024-07-20 16:22:06 UTC [ERROR] Caught kernel log error
===========================>>
[ 1843.893004] kernel: BUG: workqueue lockup - pool cpus=0-7 node=0 flags=0x4 nice=0 stuck for 54s!
<<===========================
I don't remember this error lately, reproduction rate will be very low.
Intel Internal test result link: planresultdetail/44120?model=LNLM_RVP_HDA&testcase=check-suspend-resume-with-playback-100
dmesg is not available @fredoh9 ?
right, dmesg was not captured in this case. I'm looking at the issue logs are not collected in some failure cases also.
I will use journalctl and upload equivalent dmesg when I have the access to the device.
dmesg for the failure boot from ba-lnlm-rvp-hda-02. dmesg_20240720_stress_test.txt
Jul 20 16:21:58.904948 kernel: snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-lnl 0000:00:1f.3: Current DSP power state: D3
Jul 20 16:21:58.905276 kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-lnl 0000:00:1f.3: fw_state change: 7 -> 0
Jul 20 16:21:58.905612 kernel: ACPI: EC: interrupt blocked
Jul 20 16:21:58.905671 kernel: ACPI: EC: interrupt unblocked
Jul 20 16:21:58.905713 kernel: BUG: workqueue lockup - pool cpus=0-7 node=0 flags=0x4 nice=0 stuck for 54s!
Jul 20 16:21:58.905922 kernel: Showing busy workqueues and worker pools:
Jul 20 16:21:58.952292 kernel: workqueue events: flags=0x0
Jul 20 16:21:58.952375 kernel: pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
Jul 20 16:21:58.952417 kernel: pending: vmstat_shepherd
Jul 20 16:21:58.952457 kernel: workqueue events_unbound: flags=0x2
Jul 20 16:21:58.952496 kernel: pwq 33: cpus=0-7 node=0 flags=0x4 nice=0 active=1 refcnt=2
Jul 20 16:21:58.952542 kernel: pending: crng_reseed
Not seen in recent weeks of LNL daily tests, closing.