nuttx icon indicating copy to clipboard operation
nuttx copied to clipboard

boards/esp32: Increase init task stack size to 3072

Open tmedicci opened this issue 11 months ago • 10 comments

Summary

  • boards/esp32: Increase init task stack size to 3072

This is done to avoid casual stack overflows. Especially after #11777, that increased stack usage.

Impact

Avoid stack overflow.

Testing

Internal CI testing + ESP32-DevKitC V4

tmedicci avatar Mar 05 '24 13:03 tmedicci

@tmedicci do you know how much PR #11777 increased the stack usage? Just looking that PR was impossible to know this side effect. I think we need some CI script to compare and show how much each PR are impacting in flash/RAM size and stack usage. @anchao any idea why your PR is increasing the stack?

acassis avatar Mar 05 '24 13:03 acassis

could we hold 2KB default stack size? or we have to increase the default task stack size, instead updating every defconfig.

xiaoxiang781216 avatar Mar 05 '24 14:03 xiaoxiang781216

if there is no hard requirement please leave the default stack size 2k

jerpelea avatar Mar 05 '24 14:03 jerpelea

@tmedicci do you know how much PR #11777 increased the stack usage? Just looking that PR was impossible to know this side effect. I think we need some CI script to compare and show how much each PR are impacting in flash/RAM size and stack usage. @anchao any idea why your PR is increasing the stack?

I didn't measure it directly, but it seems to be very reasonable that it would have been increased by the size of struct task_info_s.

The problem itself appears when running the built-in ps command in the NuttShell.

could we hold 2KB default stack size? or we have to increase the default task stack size, instead updating every defconfig.

Usually, 2KB is sufficient for most of the applications. The problem is specific to the init task, which is the nsh and its built-in commands. I prefer not to increase the default stack size...

tmedicci avatar Mar 05 '24 14:03 tmedicci

@tmedicci @xiaoxiang781216 if ps is failing after that PR then for some reason is increased the stack size. Espressif has an internal CI that does some hardware tests. I think having a low stack value defined in the config is useful to catch issues like this.

acassis avatar Mar 05 '24 16:03 acassis

@tmedicci do you know how much PR #11777 increased the stack usage? Just looking that PR was impossible to know this side effect. I think we need some CI script to compare and show how much each PR are impacting in flash/RAM size and stack usage. @anchao any idea why your PR is increasing the stack?

I didn't measure it directly, but it seems to be very reasonable that it would have been increased by the size of struct task_info_s.

The problem itself appears when running the built-in ps command in the NuttShell.

could we hold 2KB default stack size? or we have to increase the default task stack size, instead updating every defconfig.

Usually, 2KB is sufficient for most of the applications. The problem is specific to the init task, which is the nsh and its built-in commands. I prefer not to increase the default stack size...

If so, it's better to change the default stack size for nsh, the default value should be suitable for the common usage. Newbie will surprise that ps in nsh will panic for many bultin support board.

xiaoxiang781216 avatar Mar 05 '24 16:03 xiaoxiang781216

@tmedicci do you know how much PR #11777 increased the stack usage? Just looking that PR was impossible to know this side effect. I think we need some CI script to compare and show how much each PR are impacting in flash/RAM size and stack usage. @anchao any idea why your PR is increasing the stack?

I didn't measure it directly, but it seems to be very reasonable that it would have been increased by the size of struct task_info_s. The problem itself appears when running the built-in ps command in the NuttShell.

could we hold 2KB default stack size? or we have to increase the default task stack size, instead updating every defconfig.

Usually, 2KB is sufficient for most of the applications. The problem is specific to the init task, which is the nsh and its built-in commands. I prefer not to increase the default stack size...

If so, it's better to change the default stack size for nsh, the default value should be suitable for the common usage. Newbie will surprise that ps in nsh will panic for many bultin support board.

Stack usage is specific for each arch (there are minor changes about it depending on the arch and selected configs). One could change nsh stack size directly, but when CONFIG_INIT_ENTRYPOINT="nsh_main" (the most common), it would be overridden by CONFIG_INIT_STACKSIZE (which is the one I changed).

@tmedicci @xiaoxiang781216 if ps is failing after that PR then for some reason is increased the stack size. Espressif has an internal CI that does some hardware tests. I think having a low stack value defined in the config is useful to catch issues like this.

Yes, it is... but even before this implementation, stack usage was near its limits (it even failed for specific defconfigs that had their CONFIG_INIT_STACKSIZE increased before this PR). Our internal CI ensures that Espressif's devices can boot and run simple commands (like ps and free), but we can not ensure this is valid for all other chips. We can think on a tool to evaluate this when submitting new PRs, but I don't think it fits this PR's purposes.

tmedicci avatar Mar 05 '24 17:03 tmedicci

emm... strange, PR #11777 just move the instance of struct task_info_s into struct task_group_s:

struct task_group_s
{
...
  /* Thread local storage ***************************************************/

#ifndef CONFIG_MM_KERNEL_HEAP
  struct task_info_s tg_info_;
#endif
  FAR struct task_info_s *tg_info;
};

The instantiation of struct task_group_s comes from heap (malloc), and it does not make temporary declarations in the stack:

int group_allocate(FAR struct task_tcb_s *tcb, uint8_t ttype)
{
  FAR struct task_group_s *group;
...

  /* Allocate the group structure and assign it to the TCB */

  group = kmm_zalloc(sizeof(struct task_group_s));
  if (!group)
    {
      return -ENOMEM;
    }
...
}

In addition, I checked on the sim. It seems the stack of nsh has not changed. does esp board will declare struct task_group_s in the stack directly?

Before patch #11777:

NuttShell (NSH) NuttX-10.4.0
MOTD: username=admin password=Administrator
nsh> ps
  PID GROUP PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK           STACK   USED  FILLED COMMAND
    0     0   0 FIFO     Kthread   - Ready              0000000000000000 069616 004632   6.6%  Idle_Task
    1     1 224 FIFO     Kthread   - Waiting  Signal    0000000000000000 067536 004728   7.0%  loop_task
    2     2 224 FIFO     Kthread   - Waiting  Semaphore 0000000000000000 067504 000888   1.3%  hpwork 0x5019e0 0x501a08
    3     3 100 FIFO     Task      - Running            0000000000000000 067536 003544   5.2%  nsh_main

After patch #11777:

NuttShell (NSH) NuttX-10.4.0
MOTD: username=admin password=Administrator
nsh> ps
  PID GROUP PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK           STACK   USED  FILLED COMMAND
    0     0   0 FIFO     Kthread   - Ready              0000000000000000 069616 004632   6.6%  Idle_Task
    1     1 224 FIFO     Kthread   - Waiting  Signal    0000000000000000 067536 004680   6.9%  loop_task
    2     2 224 FIFO     Kthread   - Waiting  Semaphore 0000000000000000 067504 000888   1.3%  hpwork 0x501d00 0x501d28
    3     3 100 FIFO     Task      - Running            0000000000000000 067536 003544   5.2%  nsh_main

anchao avatar Mar 06 '24 00:03 anchao

@anchao thank you for verifying! @tmedicci please try to verify before and after this patch, maybe it could be caused by some other commit.

acassis avatar Mar 06 '24 00:03 acassis

@anchao thank you for verifying! @tmedicci please try to verify before and after this patch, maybe it could be caused by some other commit.

I found this issue through bisect and this was the specific commit that triggered the exception, but what @anchao said makes sense: it does not seem to increase the stack usage (I'm sorry for stating that before). The point is (I've said earlier) that we already had occasional exceptions and increased the init task's stack size on ESP32-S2, ESP32-S3, and even on some of the ESP32 boards. Building it with CONFIG_STACK_COLORATION shows a stack usage of 100% (or very close to that), before and after the commit. Probably, this commit just triggered the issue because memory nearby wasn't being used and the stack overflow did not trigger the exception, which is not true anymore. The init task's stack size still needs to be increased...

tmedicci avatar Mar 06 '24 12:03 tmedicci

@tmedicci you can try a new tool here: https://github.com/apache/nuttx/blob/master/sched/Kconfig#L1306-L1313 which could catch the deepest callstack automatically, not only the max stack usage.

xiaoxiang781216 avatar Mar 07 '24 12:03 xiaoxiang781216

@tmedicci you can try a new tool here: https://github.com/apache/nuttx/blob/master/sched/Kconfig#L1306-L1313 which could catch the deepest callstack automatically, not only the max stack usage.

Thanks for the tip, @xiaoxiang781216 !

tmedicci avatar Mar 07 '24 13:03 tmedicci