[FEATURE][Zephyr] IPC should be sent at OS panic
Is your feature request related to a problem? Please describe. In XTOS builds, the OS panic handler calls platform_panic(), which on all SOF platforms sends an IPC which is picked up by the Linux SOF driver, which again does a DSP panic dump to kernel log. This functionality is missing from Zephyr builds and no DSP panic dump is printed by the Linux driver.
This should also invoke the GDB stub at some point in RO and polling mode
In Zephyr, k_sys_fatal_error_handler() can be overridden for this purpose.
Moving to v2.6 . The kernel side https://github.com/thesofproject/linux/pull/4186 solution has proven to cover the most urgent need for this feature (ability to get the FW logs just before crash). We still want to implement the IPC at panic, but it is less urgent now.
Development on hold as I've been busy with other tasks. Publishiing my working code just in case someone else can continue this work -> https://github.com/kv2019i/sof/commit/cdcc190161a4f83ecc84866644cf20f7fd06dda7 This does work (and IPC is sent successfully and received by host), but where to put the code in FW, requires more thinking and work.
@RanderWang @mengdonglin fyi - needs done.
I will work on this feature based on current state.
@kv2019i In my opinion, this feature is for IPC4 zephyr build, so we need to align with ipc4 spec. We need to build CoreExceptionRecord in debug memory window 2 slot ADSP_DW_SLOT_TELEMETRY like mtrace feature which use slot id ADSP_DW_SLOT_DEBUG_LOG. The CoreExceptionRecord includes EPCx and exccause, excvaddr, excsave. Do we need to support this feature in zephyr domain or our SOF FW ? Do you have any advice ? Thanks!
@RanderWang This feature covers just the IPC. Storing the dump to a telemetry window and reading this kernel requires separate work (much like mtrace, a backend to Zephyr for the coredunmp and kernel support to read this).
But already with just the IPC sent, we have a chance to read any remaining data in the mtrace buffer (also in SRAM) and lot the exact time of crash in the kernel driver. Later once kernel gets ability to read the coredump via the debug windows, the code can be hooked to this IPC.
@RanderWang This feature covers just the IPC. Storing the dump to a telemetry window and reading this kernel requires separate work (much like mtrace, a backend to Zephyr for the coredunmp and kernel support to read this).
But already with just the IPC sent, we have a chance to read any remaining data in the mtrace buffer (also in SRAM) and lot the exact time of crash in the kernel driver. Later once kernel gets ability to read the coredump via the debug windows, the code can be hooked to this IPC.
Thanks for your reply in email and here ! It makes life easier for me. I will focus on IPC message and kernel driver side.
@plbossart any requirements from your end for @RanderWang here ?
update: managed to produce a fw panic with ref fw and Linux kernel can catch ipc msg for fw panic. The kernel will first be developed based on ref fw since it is ready and cSOF is not ready now. The major issue is that: dsp call stack is built from 64 AR dsp registers (windowed register in Xtensa) and kernel can dump dsp call stack with hex number to kernel log. We need a tool to get the stack info from kernel log and decode it with fw elf binary. Windows dump the call stack info into a file directly and use a tools to process this file with the help of fw elf file.
@RanderWang merged now ? Can we close ?
@lgirdwood no ready since zephyr part is not merged
@RanderWang (FYI @lgirdwood ) I think you can also break this into two and submit the "IPC sending at panic" separately. It's already useful to get the IPC about a FW crash, even if the coredump cannot be read from memory window. Kernel will be able to handle -> if no coredump/telemetry window is discovered at FW boot, then coredumps cannot be fetched -> kernel just prints "FW crashed" message and can print a dump of registers and state it sees. This would be more reliable than the "IPC timeout" based dump we have no in Linux driver.
Then a second step is the coredump support (implementation in Zephyr plus in lLinux driver). This enhancement item was originally intented to only cover the IPC part (not coredump!).
ok, will keep at v2.6 now, but will move to v2.7 if needed.
@RanderWang (FYI @lgirdwood ) I think you can also break this into two and submit the "IPC sending at panic" separately. It's already useful to get the IPC about a FW crash, even if the coredump cannot be read from memory window. Kernel will be able to handle -> if no coredump/telemetry window is discovered at FW boot, then coredumps cannot be fetched -> kernel just prints "FW crashed" message and can print a dump of registers and state it sees. This would be more reliable than the "IPC timeout" based dump we have no in Linux driver.
Then a second step is the coredump support (implementation in Zephyr plus in lLinux driver). This enhancement item was originally intented to only cover the IPC part (not coredump!).
@kv2019i thanks! please check https://github.com/thesofproject/sof/pull/7597
PR merged, closing as done.