Suspend and dump all CPU registers from assert
Summary
This PR first tidies up the assert handler process by moving the dump information to standalone functions. The OSINIT_PANIC state has been added to indicate a fatal error has occurred, and any further asserts in this state will trigger a direct board reset. Lastly, smp_call is used to notify all other CPUs about the assert, saving all registers to g_last_regs so the main CPU can log them.
Impact
Previously, only crashes on the main CPU included register information. Now, we have the ability to inspect the states of other CPUs as well.
Testing
Tested with arm64 qemu and internal projects. Need PR https://github.com/apache/nuttx/pull/13737 to function properly.
- Build
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -Bbuild -GNinja -DBOARD_CONFIG=boards/arm64/qemu/qemu-armv8a/configs/nsh_smp nuttx - Run
qemu-system-aarch64 -smp 4 -cpu cortex-a53 -semihosting -nographic -machine virt,virtualization=on,gic-version=3 -net none -chardev stdio,id=con,mux=on -serial chardev:con -mon chardev=con,mode=readline -kernel build/nuttx -s - execute
mw -1to trigger crash
nsh> mw -1
[CPU0] dump_assert_info: Current Version: NuttX 12.6.0 fe316efb09 Sep 29 2024 22:22:35 arm64
[CPU0] dump_assert_info: Assertion failed panic: at file: /arch/arm64/src/common/arm64_fatal.c:561 task(CPU0): nsh_main process: nsh_main 0x40294520
[CPU0] up_dump_register: stack = 0x403fb880
[CPU0] up_dump_register: x0: 0xffffffffffffffff x1: 0x0
[CPU0] up_dump_register: x2: 0xffffffffffffffff x3: 0xffffffd0
[CPU0] up_dump_register: x4: 0x4029abd4 x5: 0xaaa820a8a8aaaaa8
[CPU0] up_dump_register: x6: 0x8011 x7: 0x3aff766cfefefefe
[CPU0] up_dump_register: x8: 0x7f7f7f7fffffffff x9: 0x7fffffffffffffff
[CPU0] up_dump_register: x10: 0x101010101010101 x11: 0x30
[CPU0] up_dump_register: x12: 0x3b x13: 0x0
[CPU0] up_dump_register: x14: 0x7 x15: 0x1300000000000000
[CPU0] up_dump_register: x16: 0x0 x17: 0xc
[CPU0] up_dump_register: x18: 0x0 x19: 0x403fbf90
[CPU0] up_dump_register: x20: 0x0 x21: 0x0
[CPU0] up_dump_register: x22: 0x1 x23: 0xffffffffffffffff
[CPU0] up_dump_register: x24: 0xffffffffffffffff x25: 0x403baee6
[CPU0] up_dump_register: x26: 0x0 x27: 0x403baed8
[CPU0] up_dump_register: x28: 0x403fc297 x29: 0x403fb9a0
[CPU0] up_dump_register: x30: 0x402a1258
[CPU0] up_dump_register:
[CPU0] up_dump_register: STATUS Registers:
[CPU0] up_dump_register: SPSR: 0x20000005
[CPU0] up_dump_register: ELR: 0x402a12d8
[CPU0] up_dump_register: SP_EL0: 0x0
[CPU0] up_dump_register: SP_ELX: 0x403fb9a0
[CPU0] up_dump_register: EXE_DEPTH: 0x100000001
[CPU0] dump_stacks: ERROR: Stack pointer is not within the stack
[CPU0] dump_stack: IRQ Stack:
[CPU0] dump_stack: base: 0x403e7000
[CPU0] dump_stack: size: 00008192
[CPU0] stack_dump: 0x403e8e50: 00000000403e8ec0 000000004028f7e4 00000000403c54b8 00000000403c54b8 ...
[CPU0] dump_stack: User Stack:
[CPU0] dump_stack: base: 0x403f7f60
[CPU0] dump_stack: size: 00016336
[CPU0] stack_dump: 0x403fb220: 00000000403fb230 0000000040296914 00000000403fb240 00000000402ac170 00000000403fb270 00000000402985d4 00000000403fb4d8 0000000000000020
...
[CPU0] sched_dumpstack: backtrace| 3: 0x00000000402a12d8 0x000000004029d43c 0x000000004029c924 0x000000004029ca10 0x000000004029afd8 0x000000004029aa7c 0x0000000040294570 0x00000000402980c4
[CPU0] sched_dumpstack: backtrace| 3: 0x0000000040292c7c
[CPU0] dump_fatal_info: Dump CPU1: PAUSED
[CPU0] up_dump_register: stack = 0x403d27c0
[CPU0] up_dump_register: x0: 0x0 x1: 0x80000001
[CPU0] up_dump_register: x2: 0x80d0100 x3: 0x8
[CPU0] up_dump_register: x4: 0xffffffff x5: 0x0
[CPU0] up_dump_register: x6: 0x0 x7: 0x0
[CPU0] up_dump_register: x8: 0x0 x9: 0x0
[CPU0] up_dump_register: x10: 0x0 x11: 0x0
[CPU0] up_dump_register: x12: 0x0 x13: 0x16
[CPU0] up_dump_register: x14: 0x0 x15: 0x9000000
[CPU0] up_dump_register: x16: 0x0 x17: 0x0
[CPU0] up_dump_register: x18: 0x0 x19: 0x0
[CPU0] up_dump_register: x20: 0x0 x21: 0x0
[CPU0] up_dump_register: x22: 0x0 x23: 0x4028008c
[CPU0] up_dump_register: x24: 0x403f3000 x25: 0x4028131c
[CPU0] up_dump_register: x26: 0x0 x27: 0x0
[CPU0] up_dump_register: x28: 0x0 x29: 0x403f2fe0
[CPU0] up_dump_register: x30: 0x4028d3a4
[CPU0] up_dump_register:
[CPU0] up_dump_register: STATUS Registers:
[CPU0] up_dump_register: SPSR: 0x80000245
[CPU0] up_dump_register: ELR: 0x40294b60
[CPU0] up_dump_register: SP_EL0: 0x403f3000
[CPU0] up_dump_register: SP_ELX: 0x403f2fe0
[CPU0] up_dump_register: EXE_DEPTH: 0x1
[CPU0] dump_stacks: ERROR: Stack pointer is not within the stack
[CPU0] dump_stack: IRQ Stack:
[CPU0] dump_stack: base: 0x403e9000
[CPU0] dump_stack: size: 00008192
[CPU0] stack_dump: 0x403eaf30: 00000000403eaf40 000000004028f01c 00000000403eafa0 000000004028d5d0 00000000403f2ec0 00000000403d1680 0000000000000000 0000000000000000
...
[CPU0] dump_stack: User Stack:
[CPU0] dump_stack: base: 0x403ef010
[CPU0] dump_stack: size: 00016368
[CPU0] stack_dump: 0x403f2e80: 00000000403f2fe0 0000000040281b4c 0000000000000000 deaddeaddeaddead 00000000403fba90 0000000040281b84 0000000000000000 0000000040281bf8
...
[CPU0] sched_dumpstack: backtrace| 1: 0x0000000040294b60 0x0000000040281360
[CPU0] dump_tasks: PID GROUP CPU PRI POLICY TYPE NPX STATE EVENT SIGMASK STACKBASE STACKSIZE USED FILLED COMMAND
[CPU0] dump_tasks: ---- --- 0 --- -------- ------- --- ------- ---------- ---------------- 0x403e7000 8192 432 5.2% irq
[CPU0] dump_tasks: ---- --- 1 --- -------- ------- --- ------- ---------- ---------------- 0x403e9000 8192 208 2.5% irq
[CPU0] dump_task: 0 0 0 0 FIFO Kthread - Assigned 0000000000000000 0x403eb010 16368 1104 6.7% CPU0 IDLE
[CPU0] dump_task: 1 0 1 0 FIFO Kthread - Running 0000000000000000 0x403ef010 16368 384 2.3% CPU1 IDLE
[CPU0] dump_task: 2 0 0 192 RR Kthread - Waiting Semaphore 0000000000000000 0x403f38c0 16304 1088 6.6% hpwork 0x403c55f0 0x403c5638
[CPU0] dump_task: 3 3 0 100 RR Task - Running 0000000000000000 0x403f7f60 16336 3568 21.8% nsh_main
[CPU0] sched_dumpstack: backtrace| 0: 0x0000000040294b60 0x0000000040281294
[CPU0] sched_dumpstack: backtrace| 1: 0x0000000040294b60 0x0000000040281360
[CPU0] sched_dumpstack: backtrace| 2: 0x000000004028f4d8 0x000000004028f500 0x0000000040290904 0x0000000040292c68
[CPU0] sched_dumpstack: backtrace| 3: 0x00000000402a12d8 0x000000004029d43c 0x000000004029c924 0x000000004029ca10 0x000000004029afd8 0x000000004029aa7c 0x0000000040294570 0x00000000402980c4
[CPU0] sched_dumpstack: backtrace| 3: 0x0000000040292c7c
[Experimental Bot, please feedback here]
This PR summary doesn't fully meet the requirements. Here's why and how to improve it:
What's Missing/Needs Improvement
-
Summary:
- Why is this change necessary? The summary mentions tidying up and adding functionality, but doesn't clearly explain the problem the PR solves. Is it about improving debugging? Providing more comprehensive crash information? Be specific.
- What functional part is changed? The summary mentions assert handling but could benefit from being more precise. Which files/modules are directly affected?
- How does the change work (at a high level)? The explanation is a bit too deep in the code. Briefly describe the mechanism without going into specific functions.
-
Impact:
- New feature or existing feature change? State this explicitly (it seems like a change to an existing feature).
- Impact on the user? While you state that register information is available for other CPUs, you should clarify if this is just for developers or if it has any user-facing implications.
- Other sections: The remaining impact sections are not filled out. Consider each point carefully, even if the answer is "NO" – provide a brief justification for clarity.
-
Testing:
- Build Hosts: You mention testing with ARM64 QEMU. Please provide your host OS, CPU architecture, and compiler details for completeness.
- Targets: Mention "qemu-armv8a" specifically under arch/board.
- Testing logs: The provided logs are good. However, include a brief snippet of the logs before the change to highlight the difference your PR makes.
How to Improve the PR Summary
## Summary
This PR enhances the NuttX assert handler to provide more comprehensive crash information during development, particularly in SMP systems.
Currently, register dumps are only captured for crashes on the main CPU. This PR addresses this limitation by:
1. **Refactoring:** Moves assert dump logic into standalone functions for better organization.
2. **OSINIT_PANIC State:** Introduces a new state to indicate a fatal error, triggering a board reset upon subsequent asserts and preventing cascading failures.
3. **SMP Awareness:** Utilizes `smp_call` to notify all CPUs about the assert. Each CPU saves its registers, which are then logged by the main CPU, providing a system-wide snapshot at the time of failure.
This change primarily affects the assert handling mechanism within the `arch/arm64/src/common/arm64_fatal.c` file and related architecture-specific areas.
## Impact
* **Existing Feature Change:** Improves the assert handler for better debugging.
* **Impact on User:** Minimal impact. This change is primarily beneficial for developers during debugging. Users should not experience any behavioral changes.
* **Impact on Build:** No changes to the build process are expected.
* **Impact on Hardware:** No specific hardware changes are introduced.
* **Impact on Documentation:** Documentation updates may be beneficial to explain the new OSINIT_PANIC state and the expanded register dump information. This PR does not include these updates.
* **Impact on Security:** No security implications are anticipated.
* **Impact on Compatibility:** This change is backward compatible.
* **Anything else to consider:** None.
## Testing
I confirm that changes are verified on a local setup and work as intended:
* **Build Host:** [Your OS], [Your CPU (e.g., x86_64)], GCC [Version]
* **Target(s):** arch(sim:qemu-armv8a), board:qemu-armv8a:nsh_smp
**Testing logs before change:**
```bash
[CPU0] ... (log messages showing only CPU0 registers are dumped) ...
Testing logs after change:
[CPU0] dump_assert_info: Current Version: NuttX 12.6.0 fe316efb09 Sep 29 2024 22:22:35 arm64
[CPU0] ...(Existing logs)...
[CPU0] dump_fatal_info: Dump CPU1: PAUSED
[CPU0] up_dump_register: stack = 0x403d27c0
[CPU0] ... (register dump for CPU1) ...
By providing this level of detail, your PR will be much clearer and easier for reviewers to understand and approve.