esp-idf icon indicating copy to clipboard operation
esp-idf copied to clipboard

(Spurious?) stack overflow crash when using esp_execute_shared_stack_function() (IDFGH-11212)

Open JamieDriver opened this issue 2 years ago • 3 comments

Answers checklist.

  • [X] I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • [X] I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • [X] I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.1.1

Espressif SoC revision.

ESP32-D0WDQ6-V3 (revision v3.0)

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

ESP32 T-Display

Power Supply used.

USB

What is the expected behavior?

Run the function N times using the 16kb temporary/shared stack, and complete with ~8kb of unused stack space remaining (according to uxTaskGetStackHighWaterMark()) after each run.

What is the actual behavior?

Runs many times as expected (ie. uses 8kb of the 16kb stack, reports 8kb unused) - but will suddenly exit with 'stack overflow'.

Steps to reproduce.

  1. Edit source, set WITH_CONTENDING_TASKS to 0

  2. Flash the code, boot with monitor attached

  3. Loop will run N times and complete. Every run reports 8kb free stack.

  4. Edit source, set WITH_CONTENDING_TASKS to 1(to ensure context switch?)

  5. Flash the code, boot with monitor attached

  6. Loop will run many times reporting 8kb free stack - then main task (ie. using shared stack) will exit with 'stack-overflow'.

  7. NOTE: the same code running in the contending tasks also reports 8kb (of 16kb) free stack.

Code attached below.

Debug Logs.

I (322) main_task: Started on CPU0
I (332) main_task: Calling app_main()
T0: Invoking on task
T0: Calling fib(N) calc
T0: Fib(260) = 33930
Invoking on shared stack 0x3ffb5a9c
T1: Invoking on task
T0: Minimum free stack: 8064 bytes
T1: Calling fib(N) calc
main: Calling fib(N) calc
main: Fib(260) = 33930
T1: Fib(260) = 33930
T0: Invoking on task
T0: Calling fib(N) calc
T0: Fib(260) = 33930
main: Minimum free stack: 8212 bytes
T1: Minimum free stack: 8068 bytes
T0: Minimum free stack: 8064 bytes
Invoking on shared stack 0x3ffb5a9c
...
<many lines like the above as many iterations are good>
...
Invoking on shared stack 0x3ffb5a9c
T1: Invoking on task
T1: Calling fib(N) calc
main: Calling fib(N) calc
T1: Fib(260) = 33930
T0: Minimum free stack: 8064 bytes
main: Fib(260) = 33930
T0: Invoking on task
T0: Calling fib(N) calc
T0: Fib(260) = 33930
T1: Minimum free stack: 8068 bytes
main: Minimum free stack: 8212 bytes
T0: Minimum free stack: 8064 bytes
T1: Invoking on task
T1: Calling fib(N) calc
T1: Fib(260) = 33930
T0: Invoking on task
T0: Calling fib(N) calc
T0: Fib(260) = 33930
Invoking on shared stack 0x3ffb5a9c

***ERROR*** A stack overflow in task main has been detected.


Backtrace: 0x4008171a:0x3ffb4a80 0x40085775:0x3ffb4aa0 0x40087f62:0x3ffb4ac0 0x40086b73:0x3ffb4b40 0x400880b0:0x3ffb4b60 0x40088062:0x3ffb262c |<-CORRUPTED
0x4008171a: panic_abort at /mnt/data/esp511/esp-idf/components/esp_system/panic.c:452

0x40085775: esp_system_abort at /mnt/data/esp511/esp-idf/components/esp_system/port/esp_system_chip.c:84

0x40087f62: vApplicationStackOverflowHook at /mnt/data/esp511/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:581

0x40086b73: vTaskSwitchContext at /mnt/data/esp511/esp-idf/components/freertos/FreeRTOS-Kernel/tasks.c:3728

0x400880b0: _frxt_dispatch at /mnt/data/esp511/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/portasm.S:450

0x40088062: _frxt_int_exit at /mnt/data/esp511/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/portasm.S:245

ELF file SHA256: 6881c8e2a495b381

Rebooting...

More Information.

If in components/freertos/FreeRTOS-Kernel/tasks.c I comment out taskFIRST_CHECK_FOR_STACK_OVERFLOW() the error does not occur and the code runs to completion. ie. the sentinel/canary bytes are undamaged and that check passes ok.

        /* Check for stack overflow, if configured. */
        #ifdef ESP_PLATFORM
            taskFIRST_CHECK_FOR_STACK_OVERFLOW();
            taskSECOND_CHECK_FOR_STACK_OVERFLOW();
  1. In the attached example app the crash appears at approximately the same time after boot - in our actual application it is much more randomly distributed - perhaps suggesting conflict with some other task ?

  2. The example is very fragile - eg. attempting to add more printf()'s or more contextual information to the existing printfs can cause the crash to not occur.

JamieDriver avatar Oct 10 '23 08:10 JamieDriver

/*
 * Call tuned to require ~8kb stack
 * 'shared stack' sized to 16kb
 *
 * If run with WITH_CONTENDING_TASKS set to 0, call made N times on the
 * shared stack, no errors seen, stack usage as expected - ie. 8kb free stack
 *
 * If run with WITH_CONTENDING_TASKS set to 1, simultaneous tasks (with a higher
 * priority) are run on each core.  For convenience they are invoking the same call.
 * The tasks also show 8kb free stack as expected.
 *
 * *HOWEVER* at some point the main task (shared stack) will 'stack overflow'.
 *
 * NOTE:
 * commenting .../esp-idf/components/freertos/FreeRTOS-Kernel/tasks.c:3728
 * call to  'taskFIRST_CHECK_FOR_STACK_OVERFLOW()' prevents the issue - so the
 * canary/sentinel bytes appear untouched.
 *
 * NOTE:
 * demonstration is very 'fragile' - attempts to printf() more inforamtion can
 * easily result in the problem not manifesting.
 *
 */

#include <stdio.h>
#include <inttypes.h>
#include "sdkconfig.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_chip_info.h"
#include "esp_flash.h"

#include "esp_expression_with_stack.h"

#define WITH_CONTENDING_TASKS 1

// N is tuned so fn(N) uses ~8kb stack
static const size_t N = 260;

// Shared- and task-  stacks that have twice the amount of stack needed
// so should report ~8kb stack unused after fn() fn completed
static const size_t STACK_SIZE = 16 * 1024;

// How many times to repeat the fn() call per loop
static const size_t REPEATS = 100;

// Simple recursive fn to consume stack space
size_t fn(const size_t n)
{
    return n < 2 ? 1 : n + fn(n-1);
}

void fn_caller(void)
{
    printf("%s: Calling fn(N) calc\n", pcTaskGetName(NULL));
    printf("%s: Fib(%u) = %u\n", pcTaskGetName(NULL), N, fn(N));
    vTaskDelay(10 / portTICK_PERIOD_MS);
    printf("%s: Minimum free stack: %u bytes\n", pcTaskGetName(NULL), uxTaskGetStackHighWaterMark(NULL));
}

void task_worker(void* unused)
{
    // Tasks may as well run the same code but I believe anything would do, so long
    // as they are active and not sleeping (and also actively using their stacks?)
    // Also handyas takes same length of time to run ...
    for (int i = REPEATS; i >= 0; i--) {
        printf("%s: Invoking on task\n", pcTaskGetName(NULL));
        fn_caller();
        vTaskDelay(10 / portTICK_PERIOD_MS);
    }
    vTaskDelete(NULL);
    vTaskDelay(portMAX_DELAY);
}

void app_main(void)
{
    // NOTE: main task default stack: 3kb

    // NOTE: worker fn + param tuned to use ~8kb stack

    // Shared stack: should have ~8kb excess/unused stack
    SemaphoreHandle_t stack_mutex = NULL;
    stack_mutex = xSemaphoreCreateMutex();
    void* const shared_stack = malloc(STACK_SIZE);

    if (WITH_CONTENDING_TASKS) {
        // Run tasks on each core using same amount of stack space
        // NOTE: task priority high enough to compete with main task - context switching should occur
        // Since these are running similar tasks on similar stacks, we expect a ~8kb free stack
        TaskHandle_t t0;
        const BaseType_t retval0 = xTaskCreatePinnedToCore(task_worker, "T0", STACK_SIZE, NULL,
            tskIDLE_PRIORITY + 3, &t0, 0);
        assert(retval0 == pdPASS);

        TaskHandle_t t1;
        const BaseType_t retval1 = xTaskCreatePinnedToCore(task_worker, "T1", STACK_SIZE, NULL,
            tskIDLE_PRIORITY + 3, &t1, 1);
        assert(retval1 == pdPASS);
    }

    // Invoke function repeatedly on shared stack
    // Expecting ~8kb free unused stack after each run
    for (int i = REPEATS; i >= 0; i--) {
        printf("Invoking on shared stack %p\n", shared_stack);
        esp_execute_shared_stack_function(stack_mutex, shared_stack, STACK_SIZE, fn_caller);
        vTaskDelay(10 / portTICK_PERIOD_MS);
    }

    // If contending tasks running, main task usually dies with (spurious?) stack-overflow before this point
    printf("DONE!\n");

    free(shared_stack);
    vSemaphoreDelete(stack_mutex);

    printf("%s: MAIN STACK HWM: %u bytes\n", pcTaskGetName(NULL), uxTaskGetStackHighWaterMark(NULL));
}

JamieDriver avatar Oct 10 '23 08:10 JamieDriver

Hi @JamieDriver! Thank you for the report. Yes, I am able to reproduce this issue on master too. I will further investigate this issue.

KonstantinKondrashov avatar Oct 10 '23 13:10 KonstantinKondrashov

Recently I've faced the same issue when I was trying to implement logging with external (shared) stack. There are spurious stack overflows in the main task, especially when some heavy tasks are running (like Bluetooth). I'm using ESP-IDF v5.1.1, the chip is ESP32.

The issue is caused by a race condition inside esp_execute_shared_stack_function(). There are critical sections there, but they do not help much.

Let's look at these lines: https://github.com/espressif/esp-idf/blob/a45d713b03fd96d8805d1cc116f02a4415b360c7/components/esp_system/port/arch/xtensa/expression_with_stack.c#L53-L62

esp_switch_stack_setup() is called in the critical section. This function modifies the pxStack (or pxDummy6) field in the TCB so that it now points to the shared stack. But the actual stack pointer is modified outside the critical section, inside of esp_shared_stack_invoke_function(): https://github.com/espressif/esp-idf/blob/a45d713b03fd96d8805d1cc116f02a4415b360c7/components/esp_system/port/arch/xtensa/expression_with_stack_asm.S#L28-L29

Because of that, the following situation is possible for a short period of time when pxStack has been modified, but sp still has the old value:

               sp
               v
.......[ Task stack ].....[ Shared stack ]......
       ^                  ^
     pxStack           pxStack
    (original)          (new)

If during this period of time a context switching occurs, there will be a spurious stack overflow exception because this check will fail: https://github.com/espressif/esp-idf/blob/a45d713b03fd96d8805d1cc116f02a4415b360c7/components/freertos/FreeRTOS-Kernel/include/freertos/stack_macros.h#L66

pxTopOfStack is basically the current sp saved by a scheduler during context switching. And as illustrated by the ASCII graphics above, sp is less than pxStack.

To prevent the race condition, we should modify pxStack and sp within a single critical section, so that the scheduler always sees them synchronized with each other.

I solved the issue by basically rewriting expression_with_stack.c and expression_with_stack_asm.S from scratch. Here are the most significant changes:

  • Struct xtensa_shared_stack_ctx_t is introduced to store the context of the shared stack execution (the stack pointers, jmp_buf, etc.). This structure is allocated at the top of the shared stack. The space for it is reserved by decrementing the top_of_stack in esp_switch_stack_setup().
  • pxStack, sp and the stack watchpoint are modified atomically in critical sections to avoid race condition.
  • As the context is part of the shared stack, it is guarded by a mutex provided to esp_execute_shared_stack_function(). This makes the shared stack execution API thread safe (there is a discussion about thread safety of the current implementation).

I also had to introduce wrappers for portENTER_CRITICAL(), portEXIT_CRITICAL() and vPortSetStackWatchpoint() to be able to call them from assembly. It is impossible to use macros and static inline functions directly.

Here is a diff file with my changes which can be applied using git apply. This diff is for v5.1.1, but, as I can see, the code hasn't changed much since then. fix-shared-stack_v1.zip

If these changes are valid and worth contributing into ESP-IDF, please let me know, I will open a PR (or at least try, as I'm not very familiar with all this stuff like pre-commit hooks).

Trigger239 avatar Mar 27 '25 22:03 Trigger239