debugfs: dumping some items in debugfs will crash the system
What I have observed is that on CML/CNL platforms, running below commands will crash the whole system:
- echo on > /sys/bus/pci/devices/0000:00:1f.3/power/control
- cd /sys/kernel/debug/sof
- hexdump hda (or dsp)
There might be memory out of bound access, we need to figure out this for debug purpose.
@keyonjie I'm also able to reproduce this on my APL board when dumping /sys/kernel/debug/sof/dsp
root@gr-mrb:/sys/kernel/debug/sof# hexdump dsp
0000000 0000 0000 0202 0101 0001 0000 0000 0000
0000010 0000 0000 0000 0000 0000 0000 0000 0000
*
0000050 0003
In my case the hang seems to occur here: https://github.com/thesofproject/linux/blob/topic/sof-dev/sound/soc/sof/debug.c#L303 immediately after pos goes higher than 32768.
I don't see any bounds checks at all, shouldn't there be any? I mean I think only a finite region of memory belongs to the DSP so dumping more than this isn't exactly what I'd say would be a good idea. Having /sys/kernel/debug/sof/dsp be a glorified /dev/mem is probably not the intention here. Also returning exactly how many bytes were requested even if less are available.
(disregard this comment if I missed some bounds check in an upper layer)
there's definitively a size that's used for initialization of debugfs items, so either a) the checks are not correct or b) the sizes are incorrect in the first place.
@mengdonglin can we assign someone on this one, this looks like a really bad problem?
@keyonjie @lgirdwood do you know what the 'dsp' BAR debugfs size might be on CNL?
I find that with the following hack there's no crash:
static const struct snd_sof_debugfs_map cnl_dsp_debugfs[] = {
{"hda", HDA_DSP_HDA_BAR, 0, 0x4000, SOF_DEBUGFS_ACCESS_ALWAYS},
{"pp", HDA_DSP_PP_BAR, 0, 0x1000, SOF_DEBUGFS_ACCESS_ALWAYS},
// {"dsp", HDA_DSP_BAR, 0, 0x10000, SOF_DEBUGFS_ACCESS_ALWAYS},
{"dsp", HDA_DSP_BAR, 0, 0x1000, SOF_DEBUGFS_ACCESS_ALWAYS},
};
@plbossart it looks to be even larger than 0x10000 from the programming reference, e.g. the SDW IP registers are located in 0x30000~0x6FFFF.
From the result, I guess accessing to some slimbus, or ANC, or LP GPDMA, or DMIC registers leading to the crash.
@keyonjie I am starting to wonder if this has to do with the register ownership. I am not sure what happens if you try to access a register owned by the DSP, e.g. the LP GPDMA.
@plbossart Yes I have the same feeling. Previously I observed that we get all 0xffffffffs if the registers are not readable, but not sure if reading without ownership hold will crash the DSP or even the Linux. Hi @lbetlej do you have knowledge about this?
@plbossart the issue is still there, maybe this can be covered by security check? @libinyang @RanderWang FYI.
@keyonjie can someone paste the kernel oops. It could be a data abort, i.e. the physical bus does address does not exist (or as already mentioned owned by the DSP). Looks like @plbossart has the fix though.
@lgirdwood since the whole OS is panic when this happen, so you can't see any log anymore, hardware reboot is only thing you can do.
@plbossart will be really appreciate if you already have a fix, I can't do anything for it as the assignee of it at the moment.
@plbossart will be really appreciate if you already have a fix, I can't do anything for it as the assignee of it at the moment.
Fix is to make the BAR smaller on applicable platforms.
No I don't have a fix. I asked what the size of the memory was and didn't get an answer.
https://github.com/thesofproject/linux/issues/1296#issuecomment-641534874