linux icon indicating copy to clipboard operation
linux copied to clipboard

[BUG] capture process died after suspend-resume with capture on TGLU_VOLT_SDW_ZEPHYR

Open miRoox opened this issue 3 years ago • 3 comments

Describe the bug CI observed the capture process prematurely died after suspend-resume with capture on TGLU_VOLT_SDW_ZEPHYR.

To Reproduce

TPLG=/lib/firmware/intel/sof-tplg/sof-tgl-sdw-max98373-rt5682.tplg MODEL=TGLU_VOLT_SDW_ZEPHYR ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m capture

Reproduction Rate twice in CI, hard to reproduce manually.

Environment

  1. Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
    • Kernel: topic/sof-dev | https://github.com/thesofproject/linux/commit/cb13f81e00b4
    • SOF: main | https://github.com/thesofproject/sof/commit/4ab225b5a7c7
  2. Name of the topology file
    • Topology: sof-tgl-sdw-max98373-rt5682.tplg
  3. Name of the platform(s) on which the bug is observed.
    • Platform: TGLU_VOLT_SDW_ZEPHYR

Screenshots or console output

2022-11-06 23:51:20 UTC Sub-Test: [REMOTE_COMMAND] Run the command: rtcwake -m mem -s 5
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Sun Nov  6 23:51:26 2022
2022-11-06 23:51:26 UTC Sub-Test: [REMOTE_COMMAND] sleep for 5
2022-11-06 23:51:31 UTC Sub-Test: [REMOTE_INFO] Check for the kernel log status
declare -- cmd="journalctl_cmd --since=@1667778679"
2022-11-06 23:51:32 UTC Sub-Test: [REMOTE_INFO] Check for wakeup_count and suspend_stats

/sys/power/suspend_stats/fail:0
/sys/power/suspend_stats/failed_freeze:0
/sys/power/suspend_stats/failed_prepare:0
/sys/power/suspend_stats/failed_resume:0
/sys/power/suspend_stats/failed_resume_early:0
/sys/power/suspend_stats/failed_resume_noirq:0
/sys/power/suspend_stats/failed_suspend:0
/sys/power/suspend_stats/failed_suspend_late:0
/sys/power/suspend_stats/failed_suspend_noirq:0
/sys/power/suspend_stats/last_failed_dev:
/sys/power/suspend_stats/last_failed_errno:0
/sys/power/suspend_stats/last_failed_step:
/sys/power/suspend_stats/success:115

name		active_count	event_count	wakeup_count	expire_count	active_since	total_time	max_time	last_change	prevent_suspend_time
--- /home/ubuntu/sof-test/logs/check-suspend-resume-with-audio/2022-11-06-23:48:23-21536/initial_wakeup_sources.txt	2022-11-06 23:50:32.113555022 +0000
+++ /sys/kernel/debug/wakeup_sources	2022-11-06 21:29:35.016000000 +0000
@@ -14,7 +14,7 @@
 0000:00:14.0	0		0		0		0		0		0		0		0		0
 0000:00:0d.0	0		0		0		0		0		0		0		0		0
 alarmtimer.0.auto	0		0		0		0		0		0		0		0		0
-00:03       	117		220		0		0		0		6		0		7866726		0
+00:03       	124		230		0		0		0		7		0		7904405		0
 serio0      	0		0		0		0		0		0		0		0		0
 PNP0C0A:00  	0		0		0		0		0		0		0		0		0
 AC          	1		1		0		0		0		0		0		774		0

2022-11-06 23:51:32 UTC Sub-Test: [REMOTE_WARNING] /sys/power/wakeup_count is 125, expected 124

declare -- cmd="journalctl_cmd --since=@1667778679"
2022-11-06 23:51:32 UTC Sub-Test: [REMOTE_INFO] Test Result: PASS!
declare -- cmd="journalctl_cmd --since=@1667778628"
process: 126897, status: not found
2022-11-06 23:51:33 UTC [REMOTE_ERROR] process status is abnormal
2022-11-06 23:51:33 UTC [REMOTE_INFO] dump ps for aplay & arecord
ubuntu    127438  125737  0 23:51 ?        00:00:00 grep -E aplay|arecord
2022-11-06 23:51:33 UTC [REMOTE_INFO] dump ps for child process
UID          PID    PPID  C STIME TTY          TIME CMD
ubuntu    125804  125737  0 23:48 ?        00:00:00 /bin/bash /home/ubuntu/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m capture
ubuntu    127440  125737  0 23:51 ?        00:00:00 ps --ppid 125737 -f
2022-11-06 23:51:33 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2022-11-06 23:51:33 UTC [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/check-suspend-resume-with-audio.sh

console_log.txt dmesg.txt slogger.txt

miRoox avatar Nov 08 '22 05:11 miRoox

IPC3 + Zephyr issue, this is going to be difficult to debug given the focus on IPC4, even more so if it's not reproducible manually.

plbossart avatar Nov 10 '22 17:11 plbossart

We seem to have similar issue with ipc4+zephyr+native_drivers in tgl. Like dw dma would not start after wake up. This can be regression in sof or zephyr. @miRoox is this something that has been working before with this setup?

juimonen avatar Nov 10 '22 19:11 juimonen

Observed a similar issue on ADLP_RVP_SDW_ZEPHYR when testing suspend-resume with playback. Test ID:17914. Console output:

2022-11-20 23:47:51 UTC Sub-Test: [REMOTE_COMMAND] Run the command: rtcwake -m mem -s 5
rtcwake: wakeup from "mem" using /dev/rtc0 at Sun Nov 20 23:47:57 2022
2022-11-20 23:47:58 UTC Sub-Test: [REMOTE_COMMAND] sleep for 5
aplay: pcm_write:2127: write error: Input/output error
2022-11-20 23:48:03 UTC Sub-Test: [REMOTE_INFO] Check for the kernel log status
declare -- cmd="journalctl_cmd --since=@1668988070"
2022-11-20 23:48:04 UTC Sub-Test: [REMOTE_INFO] Check for wakeup_count and suspend_stats
declare -- cmd="journalctl_cmd --since=@1668988070"
2022-11-20 23:48:04 UTC Sub-Test: [REMOTE_INFO] Test Result: PASS!
declare -- cmd="journalctl_cmd --since=@1668988020"
process: 130430, status: not found
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/125/gvfs
      Output information may be incomplete.
2022-11-20 23:48:05 UTC [REMOTE_ERROR] process status is abnormal
2022-11-20 23:48:05 UTC [REMOTE_INFO] dump ps for aplay & arecord
ubuntu    131108  129667  0 23:48 ?        00:00:00 grep -E aplay|arecord
2022-11-20 23:48:05 UTC [REMOTE_INFO] dump ps for child process
UID          PID    PPID  C STIME TTY          TIME CMD
ubuntu    129733  129667  0 23:45 ?        00:00:00 /bin/bash /home/ubuntu/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m playback
ubuntu    131110  129667  0 23:48 ?        00:00:00 ps --ppid 129667 -f
2022-11-20 23:48:05 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2022-11-20 23:48:05 UTC [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/check-suspend-resume-with-audio.sh

keqiaozhang avatar Nov 21 '22 01:11 keqiaozhang

We have aligned to stop testing IPC3 on sof-main branch.

keqiaozhang avatar Dec 19 '22 02:12 keqiaozhang

This issue also happened on IPC4 platforms, reopening it. https://sof-ci.sh.intel.com/#/result/planresultdetail/19368?model=TGLU_RVP_SDW_IPC4ZPH&testcase=check-suspend-resume-with-playback-5

keqiaozhang avatar Dec 22 '22 01:12 keqiaozhang

found same issue on ADLP_BRYA_SDW from daily test with stable-v2.2

planresultdetail/20167?model=ADLP_BRYA_SDW&testcase=check-suspend-resume-with-playback-5

fredoh9 avatar Jan 25 '23 22:01 fredoh9

Now this is 100% failure on ADLP_BRYA_SDW with stable-v2.2

fredoh9 avatar Feb 04 '23 00:02 fredoh9

Good news: the best way to find and eventually fix race conditions is to make them "worst" = reproducible first.

EDIT: labels updated based on more recent and much wider reproductions.

marc-hb avatar Feb 06 '23 21:02 marc-hb

@bardliao noted that this issue happens on ADLP_BRYA_SDW with GCC 10.3.0 used https://sof-ci.ostc.intel.com/#/result/planresultdetail/20167?model=ADLP_BRYA_SDW&testcase=verify-kernel-boot-log

@ujfalusi @marc-hb @fredoh9 another problem with this compiler, similar to https://github.com/thesofproject/linux/pull/4203

Not an isolated incident really. What the compiler does to suspend-resume is not clear to me.

plbossart avatar Feb 22 '23 02:02 plbossart

Pretty much every race condition I ever looked at is affected by compiler changes.

marc-hb avatar Feb 23 '23 01:02 marc-hb

Pretty much every race condition I ever looked at is affected by compiler changes.

Yes, it could be a race condition.

bardliao avatar Feb 23 '23 06:02 bardliao

@fredoh9 do we still see this issue with ADLP_BRYA_SDW? I must admit not tracking this issue or forgetting about it.

plbossart avatar Apr 17 '23 23:04 plbossart

closing since it's almost been a year without any updates

plbossart avatar Apr 04 '24 19:04 plbossart