[MTL][SDW] rmmod stuck on kmod-load-unload tests
I've seen repeated issues with the kmod-load-unload not completing on MTL Dell SKU 0CC7
Updating the firmware didn't help. I can still log-in remotely but the device has to be rebooted for audio tests
starting test_kmod-load-unload
+ test_kmod-load-unload
+ /root/sof-test/test-case/check-kmod-load-unload.sh -l 1
2024-05-24 17:26:54 UTC Sub-Test: [INFO] ===== Starting iteration 1 of 1 =====
2024-05-24 17:26:54 UTC Sub-Test: [INFO] wait dsp power status to become suspended
2024-05-24 17:26:55 UTC Sub-Test: [INFO] run kmod/sof-kmod-remove.sh
WARNING: running as root is not supported
Specified filename /sys/kernel/debug/sof/trace does not exist.
SKIP snd_usb_audio not loaded
SKIP snd_hda_intel not loaded
SKIP snd_sof_pci_intel_tng not loaded
SKIP snd_sof_pci_intel_skl not loaded
SKIP snd_sof_pci_intel_apl not loaded
SKIP snd_sof_pci_intel_tgl not loaded
SKIP snd_sof_pci_intel_icl not loaded
SKIP snd_sof_pci_intel_cnl not loaded
SKIP snd_sof_pci_intel_lnl not loaded
RMMOD snd_sof_pci_intel_mtl
[ 427.318607] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 427.422240] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[ 427.422258] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 427.422264] Message payload: 00000000: 00000001 00000000
[ 614.731740] INFO: task rmmod:7293 blocked for more than 122 seconds.
[ 614.731760] Not tainted 6.9.0-rc5-test-07960-ga15f31d16f50 #205
[ 614.731764] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 614.731767] task:rmmod state:D stack:13384 pid:7293 tgid:7293 ppid:7292 flags:0x00004002
[ 614.731780] Call Trace:
[ 614.731784] <TASK>
[ 614.731792] __schedule+0x385/0xb00
[ 614.731812] schedule+0x3a/0x130
[ 614.731817] schedule_timeout+0x131/0x140
[ 614.731828] __wait_for_common+0xb9/0x1d0
[ 614.731833] ? __pfx_schedule_timeout+0x10/0x10
[ 614.731841] remove_one+0xe8/0x110
[ 614.731850] simple_recursive_removal+0x1e0/0x280
[ 614.731856] ? __pfx_remove_one+0x10/0x10
[ 614.731862] debugfs_remove+0x44/0x70
[ 614.731872] snd_sof_device_remove+0xfb/0x180 [snd_sof]
[ 614.731904] sof_pci_remove+0x1d/0x50 [snd_sof_pci]
[ 614.731913] pci_device_remove+0x3f/0xb0
[ 614.731922] device_release_driver_internal+0x1a9/0x210
[ 614.731931] driver_detach+0x4b/0x90
[ 614.731936] bus_remove_driver+0x70/0xf0
[ 614.731941] pci_unregister_driver+0x3f/0x90
[ 614.731948] __do_sys_delete_module+0x1e0/0x2c0
[ 614.731961] do_syscall_64+0x9d/0x1a0
[ 614.731967] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 614.731975] RIP: 0033:0x7f7c360577cb
[ 614.731980] RSP: 002b:00007ffc0a379158 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 614.731985] RAX: ffffffffffffffda RBX: 000055a9a1e8f700 RCX: 00007f7c360577cb
[ 614.731988] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055a9a1e8f768
[ 614.731990] RBP: 00007ffc0a379180 R08: 1999999999999999 R09: 0000000000000000
[ 614.731992] R10: 00007f7c360c8ac0 R11: 0000000000000206 R12: 0000000000000000
[ 614.731994] R13: 00007ffc0a3793e0 R14: 000055a9a1e8f700 R15: 0000000000000000
[ 614.732004] </TASK>
[ 614.732009]
Showing all locks held in the system:
[ 614.732022] 1 lock held by khungtaskd/154:
[ 614.732025] #0: ffffffff89d6a0c0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x36/0x120
[ 614.732101] 1 lock held by dmesg/3479:
[ 614.732104] #0: ffff93f70174a0e0 (&user->lock){....}-{3:3}, at: devkmsg_read+0x6c/0x230
[ 614.732117] 2 locks held by rmmod/7293:
[ 614.732120] #0: ffff93f7029251b0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x3d/0x210
[ 614.732130] #1: ffff93f70655beb8 (&sb->s_type->i_mutex_key#2){....}-{3:3}, at: simple_recursive_removal+0x15f/0x280
[ 614.732144] =============================================
[ 737.608739] INFO: task rmmod:7293 blocked for more than 245 seconds.
[ 737.608753] Not tainted 6.9.0-rc5-test-07960-ga15f31d16f50 #205
[ 737.608757] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 737.608760] task:rmmod state:D stack:13384 pid:7293 tgid:7293 ppid:7292 flags:0x00004002
[ 737.608770] Call Trace:
[ 737.608773] <TASK>
[ 737.608779] __schedule+0x385/0xb00
[ 737.608795] schedule+0x3a/0x130
[ 737.608800] schedule_timeout+0x131/0x140
[ 737.608811] __wait_for_common+0xb9/0x1d0
[ 737.608816] ? __pfx_schedule_timeout+0x10/0x10
[ 737.608824] remove_one+0xe8/0x110
[ 737.608831] simple_recursive_removal+0x1e0/0x280
[ 737.608836] ? __pfx_remove_one+0x10/0x10
[ 737.608843] debugfs_remove+0x44/0x70
[ 737.608850] snd_sof_device_remove+0xfb/0x180 [snd_sof]
[ 737.608878] sof_pci_remove+0x1d/0x50 [snd_sof_pci]
[ 737.608886] pci_device_remove+0x3f/0xb0
[ 737.608894] device_release_driver_internal+0x1a9/0x210
[ 737.608902] driver_detach+0x4b/0x90
[ 737.608907] bus_remove_driver+0x70/0xf0
[ 737.608912] pci_unregister_driver+0x3f/0x90
[ 737.608918] __do_sys_delete_module+0x1e0/0x2c0
[ 737.608930] do_syscall_64+0x9d/0x1a0
[ 737.608936] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 737.608943] RIP: 0033:0x7f7c360577cb
[ 737.608947] RSP: 002b:00007ffc0a379158 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 737.608952] RAX: ffffffffffffffda RBX: 000055a9a1e8f700 RCX: 00007f7c360577cb
[ 737.608954] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055a9a1e8f768
[ 737.608957] RBP: 00007ffc0a379180 R08: 1999999999999999 R09: 0000000000000000
[ 737.608959] R10: 00007f7c360c8ac0 R11: 0000000000000206 R12: 0000000000000000
[ 737.608961] R13: 00007ffc0a3793e0 R14: 000055a9a1e8f700 R15: 0000000000000000
[ 737.608970] </TASK>
[ 737.608975]
Showing all locks held in the system:
[ 737.608986] 1 lock held by khungtaskd/154:
[ 737.608988] #0: ffffffff89d6a0c0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x36/0x120
[ 737.609019] 1 lock held by dmesg/3479:
[ 737.609021] #0: ffff93f70174a0e0 (&user->lock){....}-{3:3}, at: devkmsg_read+0x6c/0x230
[ 737.609032] 2 locks held by rmmod/7293:
[ 737.609035] #0: ffff93f7029251b0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x3d/0x210
[ 737.609045] #1: ffff93f70655beb8 (&sb->s_type->i_mutex_key#2){....}-{3:3}, at: simple_recursive_removal+0x15f/0x280
[ 737.609059] =============================================
reproduced with kernel topic/sof-dev d99d9a0ab917 ALSA: hda: intel-sdw-acpi: use acpi_get_local_u64_address() SOF main : 737d4d41fbaa app: add winconsole_overlay.conf
TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-mtl-rt713-l0-rt1316-l12.tplg ~/sof-test/test-case/run-all-tests.sh -l1
The minimal sequence to make the problem appear is this in test-case/run-all-tests.sh
testlist=" multiple-pause-resume kmod-load-unload "
Without the multiple-pause-resume no errors are reported.
this also happens when I remove everything except 'Jack Out"
starting test_multiple-pause-resume
+ test_multiple-pause-resume
+ /root/sof-test/test-case/multiple-pause-resume.sh -l 1 -r 25
2024-05-24 20:18:56 UTC Sub-Test: [INFO] /root/sof-test/test-case/multiple-pause-resume.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-mtl-rt713-l0-rt1316-l12.tplg to run the test case
2024-05-24 20:18:56 UTC Sub-Test: [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital,HDMI1,HDMI2,HDMI3,Deepbuffer Jack Out,Speaker,Amp feedback,Jack In' in test case
2024-05-24 20:18:56 UTC Sub-Test: [INFO] Run command to get pipeline parameters
2024-05-24 20:18:56 UTC Sub-Test: [COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-mtl-rt713-l0-rt1316-l12.tplg -f 'type:any & ~pcm:Amplifier Reference' -b ' pcm:HDA Digital,HDMI1,HDMI2,HDMI3,Deepbuffer Jack Out,Speaker,Amp feedback,Jack In' -s 0 -e
2024-05-24 20:18:56 UTC Sub-Test: [INFO] Starting /usr/local/bin/mtrace-reader.py >& /root/sof-test/logs/multiple-pause-resume/2024-05-24-16:18:56-14946/mtrace.txt &
2024-05-24 20:18:57 UTC Sub-Test: [WARNING] pipeline count is 1, don't need to run this case
----------
----------
starting test_kmod-load-unload
+ test_kmod-load-unload
+ /root/sof-test/test-case/check-kmod-load-unload.sh -l 1
2024-05-24 20:19:00 UTC Sub-Test: [INFO] ===== Starting iteration 1 of 1 =====
2024-05-24 20:19:00 UTC Sub-Test: [INFO] wait dsp power status to become suspended
2024-05-24 20:19:00 UTC Sub-Test: [INFO] run kmod/sof-kmod-remove.sh
WARNING: running as root is not supported
Specified filename /sys/kernel/debug/sof/trace does not exist.
SKIP snd_usb_audio not loaded
SKIP snd_hda_intel not loaded
SKIP snd_sof_pci_intel_tng not loaded
SKIP snd_sof_pci_intel_skl not loaded
SKIP snd_sof_pci_intel_apl not loaded
SKIP snd_sof_pci_intel_tgl not loaded
SKIP snd_sof_pci_intel_icl not loaded
SKIP snd_sof_pci_intel_cnl not loaded
SKIP snd_sof_pci_intel_lnl not loaded
RMMOD snd_sof_pci_intel_mtl
looks like a test script problem maybe. I can't reproduce this with the simpler sequence
sudo TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-mtl-rt713-l0-rt1316-l12.tplg ~/sof-test/test-case/multiple-pause-resume.sh -l 1 -r 25
~/sof-test/test-case/check-kmod-load-unload.sh -l 2
maybe it's the debugfs stuff that we read now that causes a problem while removing debugfs. Or we have an open debugfs file.
Looks like no one is using run-all-tests.sh?
Just saw this as well on LNL
starting test_kmod-load-unload
+ test_kmod-load-unload
+ /home/ubuntu/sof-test/test-case/check-kmod-load-unload.sh -l 1
2024-06-19 08:24:57 UTC Sub-Test: [INFO] ===== Starting iteration 1 of 1 =====
2024-06-19 08:24:57 UTC Sub-Test: [INFO] wait dsp power status to become suspended
2024-06-19 08:24:57 UTC Sub-Test: [INFO] run kmod/sof-kmod-remove.sh
Specified filename /sys/kernel/debug/sof/trace does not exist.
RMMOD snd_usb_audio
SKIP snd_hda_intel not loaded
SKIP snd_sof_pci_intel_tng not loaded
SKIP snd_sof_pci_intel_skl not loaded
SKIP snd_sof_pci_intel_apl not loaded
SKIP snd_sof_pci_intel_tgl not loaded
SKIP snd_sof_pci_intel_icl not loaded
SKIP snd_sof_pci_intel_cnl not loaded
RMMOD snd_sof_pci_intel_lnl
[ 277.387525] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 277.539501] snd_soc_rt711_sdca:rt711_sdca_calibration: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_calibration calibration complete, ret=0
[ 277.541983] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[ 277.542024] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 277.542037] Message payload: 00000000: 00000001 00000000
[ 277.547478] snd_soc_rt711_sdca:rt711_sdca_jack_init: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_init enable
[ 277.547489] snd_soc_rt711_sdca:rt711_sdca_io_init: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_io_init hw_init complete
[ 277.547493] soundwire_bus:sdw_handle_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling initialization completion for Slave 6
[ 277.552306] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x4000000
[ 277.565812] snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
[ 277.568162] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000000
[ 277.669458] snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
[ 280.639869] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-3: clock stop prepare done slave:15
[ 280.639910] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-1: clock stop prepare done slave:15
[ 280.640037] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-2: clock stop prepare done slave:15
[ 281.306958] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-0: clock stop prepare done slave:15
The pause-resume tests had many implementation issues, I just rewrote it in https://github.com/thesofproject/sof-test/pull/1218 Can you try again?
It's not surprising that kmod-load-unload could hang if the pause-resume test left something bad behind. Examples:
- https://github.com/thesofproject/linux/issues/5048
- https://github.com/thesofproject/linux/issues/3766
- etc.
CI is generally more robust that run-all-tests.sh so it's less likely to leave something bad behind. Not impossible but less likely in CI.
@plbossart can we close after https://github.com/thesofproject/sof-test/pull/1218 ?
I haven't seen this problem in a very very long time, closing