Module unload failure following IPC timeout
After seeing an IPC timeout from the firmware, the snd_sof_pci_intel_mtl module hangs trying to unload, leaving the kernel in an unrecoverable state that requires a reboot to resume audio. (See #8638 for an easy recipe for causing a timeout).
This is the script I cooked up to get the full module stack reloaded with correct dependency ordering (at least in the kernel I'm using). It runs fine when not in an error state, but after the IPC failure it only gets as far as the MTL module then hangs.
#!/bin/sh
rmmod snd_soc_sof_rt5682
rmmod snd_soc_rt5645
rmmod snd_soc_hdac_hdmi
rmmod snd_soc_intel_hda_dsp_common
rmmod snd_soc_intel_sof_maxim_common
rmmod snd_soc_intel_sof_realtek_common
rmmod snd_soc_intel_sof_ssp_common
rmmod snd_soc_rt5682
rmmod snd_sof_probes
rmmod snd_soc_rl6231
rmmod snd_hda_codec_hdmi
rmmod snd_soc_dmic
rmmod snd_sof_pci_intel_mtl
rmmod snd_sof_intel_hda_common
rmmod snd_sof_intel_hda
rmmod soundwire_intel
rmmod soundwire_generic_allocation
rmmod snd_sof_intel_hda_mlink
rmmod soundwire_cadence
rmmod snd_sof_pci
rmmod snd_sof_xtensa_dsp
rmmod snd_soc_hdac_hda
rmmod snd_soc_acpi_intel_match
rmmod snd_soc_acpi
rmmod snd_hda_ext_core
rmmod snd_sof
rmmod snd_sof_utils
rmmod soundwire_bus
rmmod snd_intel_dspcfg
rmmod snd_intel_sdw_acpi
rmmod snd_hda_codec
rmmod snd_hwdep
rmmod snd_hda_core
rmmod snd_soc_rt5682s
rmmod snd_soc_max98357a
modprobe snd_soc_rt5682s
modprobe snd_soc_hdac_hdmi
modprobe snd_soc_max98357a
modprobe snd_sof_pci_intel_mtl
The immediate impact to me is just debugging speed (it's really annoying to wait for a reboot). But in general this kind of "module with no dependencies won't unload" issue is accompanied by more serious things like dangling pointers or memory leaks. Needs attention at reasonably high priority.
Sorry, issue is in different project. See the SOF bug 8638 for a reproduction recipe: https://github.com/thesofproject/sof/issues/8638
@andyross Is this still an issue or was this specific to https://github.com/thesofproject/sof/issues/8638 (given the fix for that was in the end on LInux kernel side) ? I agree this is a mechanism that should work and basicly we depend on this in our CI. We use the scripts at https://github.com/thesofproject/sof-test/tree/main/tools/kmod to unload/reload with all dependencies sorted out.
Of course, mileage may vary depending on how badly the DSP fails, but in typical case, the module reload will work.
If this still occurs, can you share kernel logs on a case it fails? And can you doublecheck with "lsof" that no user-space entity is holding on to driver resources? Having mtrace-reader.py running will also block kernel module unload. But probably these you have already checked.
I literally just validated, and indeed the fix referenced (which should be noted was merged before the bug report, my image was about two weeks stale) fixes the DSP PM management and unblocks this. I can reload successfullly now.
But this is a separate issue. The proximate cause is the DSP hang due to a kernel bug, but it could have been anything. You can imagine the DSP deliberately doing this (checking for a IDC handling a comp_free for a DP component) and then just arch_irq_lock();while(1);. To the kernel this would look identical, and it would be stuck and unable to recover audio without a reboot. The kernel needs to be able to bounce the DSP and recover state in all circumstances.
To be clear: this isn't currently a ChromeOS recovery method, but it might be. Fixing it isn't high priority, but we should at least get to some kind of affirmative analysis that says this is benign (i.e. that the only symptom is a hung rmmod and that there isn't a crash bug in there somewhere due to the removed dependencies).
Ack, that scenario should work. In in fact, I can confirm we have regularly such cases where the DSP panics in a CI run, and the Linux kernel does recover. We have some debug options that interfere with this a bit (e.g. with CONFIG_SND_SOC_SOF_DEBUG_RETAIN_DSP_CONTEXT, DSP is left powered on after a crash. you can remove the SOF kernel modules, but a runtime pm ref is leaked on purpose so runtime suspend iwll no longer work even if you reload).
So in short, if this still happens, this is a valid bug and affects the SOF CI as well.
@andyross @kv2019i FYI we've listed recovery as a needed capability back in 2018 https://github.com/thesofproject/linux/issues/452 and again in 2020 https://github.com/thesofproject/linux/issues/1675
It's been done before on legacy Intel drivers but we don't have a signaling mechanism (heartbeat or something) at the firmware level nor a detection/recovery on the host side. And we'd also need to signal a reset to userspace.
I guess once we have the IMR context save we'll probably need something anyways, for now we reset the state when going back to D0 but it'll not longer be true with MTL+.
@andyross should we close this issue?
no information, closing