nuttx icon indicating copy to clipboard operation
nuttx copied to clipboard

Added memory health checks.

Open fjpanag opened this issue 2 years ago • 8 comments

Summary

The PR #5368 removed the memory checks from the idle thread, due to issues with priority inheritance.

This is an attempt to restore these checks, this time running in a worker.

If this feature is enabled, a worker will be started during system boot, that performs the following checks periodically:

  • Checks the stacks of all threads for any possible overflow.
  • Checks the heap for any corruption.

Impact

None on existing systems.

Testing

I did a quick test on a custom board running on an STM32F427 MCU.
The check is started and executed as expected.


Note, although this code is tested to be working, it is marked as a draft.

I am not very sure about the correct structure of the code, and if it adheres to NuttX standards. Please review it, and provide me with some feedback.

Mostly, is it correct to create this new dir mm_check, or does the code belong anywhere else?
Is the worker started at the correct place?
Any naming issues?
Etc...

fjpanag avatar Jul 16 '22 14:07 fjpanag

@fjpanag use the check here should be enough: https://github.com/apache/incubator-nuttx/blob/master/sched/irq/irq_dispatch.c#L177-L183

xiaoxiang781216 avatar Jul 16 '22 16:07 xiaoxiang781216

Hi @fjpanag, you forgot to comment this feature shouldn't be used on production. Suggestion: keep it dependent on CONFIG_DEBUG_MM

acassis avatar Jul 16 '22 20:07 acassis

@fjpanag use the check here should be enough: https://github.com/apache/incubator-nuttx/blob/master/sched/irq/irq_dispatch.c#L177-L183

I wasn't aware that this check is also present there.

Shall I drop this PR?

fjpanag avatar Jul 16 '22 21:07 fjpanag

I wasn't aware that this check is also present there.

Shall I drop this PR?

The check in irq_dispatch() only calls kmm_checkforcorruption() on IRQ exits to check heaps for TCBs marked with TCB_FLAG_MEM_CHECK (and there are no mentions of the flag whatsoever in the master codebase). It also build-depends on CONFIG_DEBUG_MM, as @acassis suggests for this checker as well.

I see the PR as still useful because it also checks stacks, and it doesn't require setting a tcb flag. Yes, nsh> ps can show stack usage in interactive shell, though it requires STACK_COLORATION and provides virtually a very slow sample rate, as does stackmonitor. However, it might conflict / do double work with irq_dispatch. The good thing is we're decoupling memory checks from IDLE thread which used to cause problems in #5266, and that issue/PR mentioned something about moving checks to LPWORK thread (half a year ago).

@fjpanag How many processes/threads (and pthreads) were running on your STM32F427 board? Did you enable DEBUG_MM as well? Can you somehow verify that there won't be deadlock problems with semaphores of mm? Was there any external (FMC SDRAM) memory attached to system heaps?

ALTracer avatar Jul 17 '22 19:07 ALTracer

I wasn't aware that this check is also present there. Shall I drop this PR?

The check in irq_dispatch() only calls kmm_checkforcorruption() on IRQ exits to check heaps for TCBs marked with TCB_FLAG_MEM_CHECK (and there are no mentions of the flag whatsoever in the master codebase). It also build-depends on CONFIG_DEBUG_MM, as @acassis suggests for this checker as well.

We can refine how to trigger the check. But irq_dispatch is the best place to get the reliable result since if kmm_checkforcorruption report the error, we can ascertain that the interrupted thread corrupt the memory. On the another hand, even LP work detect the memory corruption, it's still hard to identify which thread corrupt the memory.

I see the PR as still useful because it also checks stacks, and it doesn't require setting a tcb flag. Yes, nsh> ps can show stack usage in interactive shell, though it requires STACK_COLORATION and provides virtually a very slow sample rate, as does stackmonitor. However, it might conflict / do double work with irq_dispatch. The good thing is we're decoupling memory checks from IDLE thread which used to cause problems in #5266, and that issue/PR mentioned something about moving checks to LPWORK thread (half a year ago).

It's better to enable STACK_CANARIES, ARMV8M_STACKCHECK_HARDWARE or ARMV[7|8]M_STACKCHECK, since they can report the stack overflow immediately.

@fjpanag How many processes/threads (and pthreads) were running on your STM32F427 board? Did you enable DEBUG_MM as well? Can you somehow verify that there won't be deadlock problems with semaphores of mm? Was there any external (FMC SDRAM) memory attached to system heaps?

xiaoxiang781216 avatar Jul 18 '22 02:07 xiaoxiang781216

We can refine how to trigger the check. But irq_dispatch is the best place to get the reliable result since if kmm_checkforcorruption report the error, we can ascertain that the interrupted thread corrupt the memory. On the another hand, even LP work detect the memory corruption, it's still hard to identify which thread corrupt the memory.

Okay, can someone launch SEGGER SystemView and verify the board spends a reasonable amount of time checking all the heaps on every context switch? What if a heap is located in slow external SDRAM? LPWORK thread, on the other hand, runs once a second or slower. I'd change the Kconfig setting from seconds to milliseconds, though. (Or microseconds, to be consistent with other apps) And there's always KAsan, which is designed to catch heap access errors with its guard pages.

It's better to enable STACK_CANARIES, ARMV8M_STACKCHECK_HARDWARE or ARMV[7|8]M_STACKCHECK, since they can report the stack overflow immediately.

Yes, on armv8-m stackcheck_hardware is essentially free. However, on armv6/7-m it reserves a register (r10), requires userspace be compiled with additional CFLAGS, and checks stacks on function exits (not on context switch, which might be more often) IIRC. Again, that doesn't let us set a safety margin of e.g. 90%, like in this PR.

ALTracer avatar Jul 18 '22 09:07 ALTracer

We can refine how to trigger the check. But irq_dispatch is the best place to get the reliable result since if kmm_checkforcorruption report the error, we can ascertain that the interrupted thread corrupt the memory. On the another hand, even LP work detect the memory corruption, it's still hard to identify which thread corrupt the memory.

Okay, can someone launch SEGGER SystemView and verify the board spends a reasonable amount of time checking all the heaps on every context switch? What if a heap is located in slow external SDRAM?

It isn't fast as expect, that's why we add the per TCB filter capability. One approach is do the check when the next thread is idle which give the same check as before.

LPWORK thread, on the other hand, runs once a second or slower. I'd change the Kconfig setting from seconds to milliseconds, though. (Or microseconds, to be consistent with other apps) And there's always KAsan, which is designed to catch heap access errors with its guard pages.

If you like, the same policy can be inserted in irq_dispatch.

It's better to enable STACK_CANARIES, ARMV8M_STACKCHECK_HARDWARE or ARMV[7|8]M_STACKCHECK, since they can report the stack overflow immediately.

Yes, on armv8-m stackcheck_hardware is essentially free. However, on armv6/7-m it reserves a register (r10), requires userspace be compiled with additional CFLAGS, and checks stacks on function exits (not on context switch, which might be more often) IIRC. Again, that doesn't let us set a safety margin of e.g. 90%, like in this PR.

You can try STACK_CANARIES for arm6/7-m which is fast and lightweight: https://lwn.net/Articles/584225/ https://www.redhat.com/en/blog/security-technologies-stack-smashing-protection-stackguard

xiaoxiang781216 avatar Jul 18 '22 09:07 xiaoxiang781216

@fjpanag @ALTracer https://github.com/apache/incubator-nuttx/pull/6724 provide the method to enable the heap check from nsh. So, you can do:

  1. Check the heap corruption issue by: a. KSan https://github.com/apache/incubator-nuttx/blob/master/mm/Kconfig#L181-L187 b. Per-thread heap check(https://github.com/apache/incubator-nuttx/pull/6724) by: echo 1 > /proc/xxx/heapcheck c. Track the allocator(pid and backtrace) by MM_BACKTRACE: https://github.com/apache/incubator-nuttx/blob/master/mm/Kconfig#L189-L200 and memdump command from: https://github.com/apache/incubator-nuttx-apps/blob/master/nshlib/nsh_mmcmds.c#L53-L64
  2. Check the stack corruption issue by: a. The stack coloration: https://github.com/apache/incubator-nuttx/blob/master/Kconfig#L1869-L1878 b. The stack canaries: https://github.com/apache/incubator-nuttx/blob/master/Kconfig#L1880-L1892 c. The stack hw/sw check: https://github.com/apache/incubator-nuttx/blob/master/arch/arm/src/armv8-m/Kconfig#L108-L146

The infrastructure is enough to cover the most debug case.

xiaoxiang781216 avatar Jul 31 '22 14:07 xiaoxiang781216

@fjpanag do you want to improve this patch?

xiaoxiang781216 avatar Dec 02 '22 15:12 xiaoxiang781216

@fjpanag do you want to improve this patch?

Yes yes, I will.

I am struggling to find some free time, first for the TCP connections, and then for this.

I will try my best to finish this soon.

fjpanag avatar Dec 03 '22 13:12 fjpanag