[BUG] capture process died after suspend-resume with capture on TGLU_VOLT_SDW_ZEPHYR
Describe the bug CI observed the capture process prematurely died after suspend-resume with capture on TGLU_VOLT_SDW_ZEPHYR.
To Reproduce
TPLG=/lib/firmware/intel/sof-tplg/sof-tgl-sdw-max98373-rt5682.tplg MODEL=TGLU_VOLT_SDW_ZEPHYR ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m capture
Reproduction Rate twice in CI, hard to reproduce manually.
Environment
- Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
- Kernel: topic/sof-dev | https://github.com/thesofproject/linux/commit/cb13f81e00b4
- SOF: main | https://github.com/thesofproject/sof/commit/4ab225b5a7c7
- Name of the topology file
- Topology: sof-tgl-sdw-max98373-rt5682.tplg
- Name of the platform(s) on which the bug is observed.
- Platform: TGLU_VOLT_SDW_ZEPHYR
Screenshots or console output
2022-11-06 23:51:20 UTC Sub-Test: [REMOTE_COMMAND] Run the command: rtcwake -m mem -s 5
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Sun Nov 6 23:51:26 2022
2022-11-06 23:51:26 UTC Sub-Test: [REMOTE_COMMAND] sleep for 5
2022-11-06 23:51:31 UTC Sub-Test: [REMOTE_INFO] Check for the kernel log status
declare -- cmd="journalctl_cmd --since=@1667778679"
2022-11-06 23:51:32 UTC Sub-Test: [REMOTE_INFO] Check for wakeup_count and suspend_stats
/sys/power/suspend_stats/fail:0
/sys/power/suspend_stats/failed_freeze:0
/sys/power/suspend_stats/failed_prepare:0
/sys/power/suspend_stats/failed_resume:0
/sys/power/suspend_stats/failed_resume_early:0
/sys/power/suspend_stats/failed_resume_noirq:0
/sys/power/suspend_stats/failed_suspend:0
/sys/power/suspend_stats/failed_suspend_late:0
/sys/power/suspend_stats/failed_suspend_noirq:0
/sys/power/suspend_stats/last_failed_dev:
/sys/power/suspend_stats/last_failed_errno:0
/sys/power/suspend_stats/last_failed_step:
/sys/power/suspend_stats/success:115
name active_count event_count wakeup_count expire_count active_since total_time max_time last_change prevent_suspend_time
--- /home/ubuntu/sof-test/logs/check-suspend-resume-with-audio/2022-11-06-23:48:23-21536/initial_wakeup_sources.txt 2022-11-06 23:50:32.113555022 +0000
+++ /sys/kernel/debug/wakeup_sources 2022-11-06 21:29:35.016000000 +0000
@@ -14,7 +14,7 @@
0000:00:14.0 0 0 0 0 0 0 0 0 0
0000:00:0d.0 0 0 0 0 0 0 0 0 0
alarmtimer.0.auto 0 0 0 0 0 0 0 0 0
-00:03 117 220 0 0 0 6 0 7866726 0
+00:03 124 230 0 0 0 7 0 7904405 0
serio0 0 0 0 0 0 0 0 0 0
PNP0C0A:00 0 0 0 0 0 0 0 0 0
AC 1 1 0 0 0 0 0 774 0
2022-11-06 23:51:32 UTC Sub-Test: [REMOTE_WARNING] /sys/power/wakeup_count is 125, expected 124
declare -- cmd="journalctl_cmd --since=@1667778679"
2022-11-06 23:51:32 UTC Sub-Test: [REMOTE_INFO] Test Result: PASS!
declare -- cmd="journalctl_cmd --since=@1667778628"
process: 126897, status: not found
2022-11-06 23:51:33 UTC [REMOTE_ERROR] process status is abnormal
2022-11-06 23:51:33 UTC [REMOTE_INFO] dump ps for aplay & arecord
ubuntu 127438 125737 0 23:51 ? 00:00:00 grep -E aplay|arecord
2022-11-06 23:51:33 UTC [REMOTE_INFO] dump ps for child process
UID PID PPID C STIME TTY TIME CMD
ubuntu 125804 125737 0 23:48 ? 00:00:00 /bin/bash /home/ubuntu/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m capture
ubuntu 127440 125737 0 23:51 ? 00:00:00 ps --ppid 125737 -f
2022-11-06 23:51:33 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2022-11-06 23:51:33 UTC [REMOTE_ERROR] main() @ /home/ubuntu/sof-test/test-case/check-suspend-resume-with-audio.sh
IPC3 + Zephyr issue, this is going to be difficult to debug given the focus on IPC4, even more so if it's not reproducible manually.
We seem to have similar issue with ipc4+zephyr+native_drivers in tgl. Like dw dma would not start after wake up. This can be regression in sof or zephyr. @miRoox is this something that has been working before with this setup?
Observed a similar issue on ADLP_RVP_SDW_ZEPHYR when testing suspend-resume with playback. Test ID:17914. Console output:
2022-11-20 23:47:51 UTC Sub-Test: [REMOTE_COMMAND] Run the command: rtcwake -m mem -s 5
rtcwake: wakeup from "mem" using /dev/rtc0 at Sun Nov 20 23:47:57 2022
2022-11-20 23:47:58 UTC Sub-Test: [REMOTE_COMMAND] sleep for 5
aplay: pcm_write:2127: write error: Input/output error
2022-11-20 23:48:03 UTC Sub-Test: [REMOTE_INFO] Check for the kernel log status
declare -- cmd="journalctl_cmd --since=@1668988070"
2022-11-20 23:48:04 UTC Sub-Test: [REMOTE_INFO] Check for wakeup_count and suspend_stats
declare -- cmd="journalctl_cmd --since=@1668988070"
2022-11-20 23:48:04 UTC Sub-Test: [REMOTE_INFO] Test Result: PASS!
declare -- cmd="journalctl_cmd --since=@1668988020"
process: 130430, status: not found
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/125/gvfs
Output information may be incomplete.
2022-11-20 23:48:05 UTC [REMOTE_ERROR] process status is abnormal
2022-11-20 23:48:05 UTC [REMOTE_INFO] dump ps for aplay & arecord
ubuntu 131108 129667 0 23:48 ? 00:00:00 grep -E aplay|arecord
2022-11-20 23:48:05 UTC [REMOTE_INFO] dump ps for child process
UID PID PPID C STIME TTY TIME CMD
ubuntu 129733 129667 0 23:45 ? 00:00:00 /bin/bash /home/ubuntu/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m playback
ubuntu 131110 129667 0 23:48 ? 00:00:00 ps --ppid 129667 -f
2022-11-20 23:48:05 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2022-11-20 23:48:05 UTC [REMOTE_ERROR] main() @ /home/ubuntu/sof-test/test-case/check-suspend-resume-with-audio.sh
We have aligned to stop testing IPC3 on sof-main branch.
This issue also happened on IPC4 platforms, reopening it. https://sof-ci.sh.intel.com/#/result/planresultdetail/19368?model=TGLU_RVP_SDW_IPC4ZPH&testcase=check-suspend-resume-with-playback-5
found same issue on ADLP_BRYA_SDW from daily test with stable-v2.2
planresultdetail/20167?model=ADLP_BRYA_SDW&testcase=check-suspend-resume-with-playback-5
Now this is 100% failure on ADLP_BRYA_SDW with stable-v2.2
Good news: the best way to find and eventually fix race conditions is to make them "worst" = reproducible first.
EDIT: labels updated based on more recent and much wider reproductions.
@bardliao noted that this issue happens on ADLP_BRYA_SDW with GCC 10.3.0 used https://sof-ci.ostc.intel.com/#/result/planresultdetail/20167?model=ADLP_BRYA_SDW&testcase=verify-kernel-boot-log
@ujfalusi @marc-hb @fredoh9 another problem with this compiler, similar to https://github.com/thesofproject/linux/pull/4203
Not an isolated incident really. What the compiler does to suspend-resume is not clear to me.
Pretty much every race condition I ever looked at is affected by compiler changes.
Pretty much every race condition I ever looked at is affected by compiler changes.
Yes, it could be a race condition.
@fredoh9 do we still see this issue with ADLP_BRYA_SDW? I must admit not tracking this issue or forgetting about it.
closing since it's almost been a year without any updates