RIOT icon indicating copy to clipboard operation
RIOT copied to clipboard

tests: add test for context switch from ISR

Open maribu opened this issue 4 years ago • 19 comments

Contribution description

The new tests triggers context switching from ISR with a high load with enabled stack tests. If the context switching code does not disable IRQs until a context switch is fully completed, an IRQ might be served while not all the saved context is consumed from the stack. Normally, this is not an issue, as the context switch is just resumed after the ISR completes. However, under high load data on the stack can accumulate and eventually overflow.

This tests puts the board under high IRQ load to see if such stack overflows can happen.

Testing procedure

make BOARD=<SOME_BOARD> -C tests/isr_context_switch flash test

Issues/PRs references

Came up in https://github.com/RIOT-OS/RIOT/pull/16211

maribu avatar Apr 06 '21 19:04 maribu

Looks good, can you add a README to the test? Something like your PR description would be enough IMO.

fjmolinas avatar Apr 07 '21 06:04 fjmolinas

All my boards passed the test, I'll ACK once there is a small README, please trigger the CI afterward @maribu

fjmolinas avatar Apr 07 '21 07:04 fjmolinas

ping @maribu

fjmolinas avatar Apr 20 '21 07:04 fjmolinas

Sorry for the delay. Sadly, there is indeed a bug in the AVR context switch implementation, as @nandojve pointed out, that this test doesn't trigger.

I'll try to spice up the test first. @nandojve found a way to do so, but it depends on a platform being able to trigger IRQs on GPIOs by configuring them as output and controlling them. While this works fine on AVR, I don't really want to have this test relying on this.

@nandojve: I think this was actually the approach used for context switching some time ago for AVR, but it resulting in a GPIO pin and a precious external interrupt being used. It was actually quite smart, as the context pushed to stack would be identical regardless whether it was triggered from within an ISR, or synchron by a "soft IRQ". Reverting to this implementation would solve the issue again. I have some ideas on how to fake a soft IRQ without relying on real hardware. This would solve the issue and use avrlibc for saving and restoring the context, getting rid of a lot of inline assembly in RIOT. It would also ease adding new AVR boards that are already supported by the avrlibc.

maribu avatar Apr 20 '21 09:04 maribu

I'll rebase on master to get make compile-commands in this branch.

maribu avatar Apr 20 '21 09:04 maribu

I'll rebase on master to get make compile-commands in this branch.

@maribu , I would like remind that if there is a possibility to use nested ISR I would say IRQ0 should be the answer. Why? when user enable IRQ any high priority can preempt current one. If context switch is on IRQ0 ( highest priority ) no one can preempt it and the bug is solved for that scenario too.

nandojve avatar Apr 20 '21 12:04 nandojve

OK, I basically rewrote the test. This test is no able to confirm the issue for ATxmega board. However, I still fail to trigger the stack overflow on an Arduino Mega2560.

Since I had to use a hardware timer directly to trigger the issue reliably (well, for my atxmega-a1u-xpro at least), the test ended up being a bit more painful. It might be needed to adjust some test parameters for some platforms to get the test working :-(

maribu avatar Apr 22 '21 06:04 maribu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

stale[bot] avatar Mar 02 '22 09:03 stale[bot]

still relevant I guess

kaspar030 avatar Mar 02 '22 10:03 kaspar030

Looks like on the samr21-xpro the test actually found an issue

maribu avatar May 30 '22 10:05 maribu

Tested successfully with an arduino-uno and nrf52dk target. On the seeedstudio-xiao-nrf52840, this test will cause the device to crash and return to the bootloader.

This is what I get in hterm, I did not press anything.

Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
READY
Help: Press s to start test, r to print it is ready
START
main(): This is RIOT! (Version: 2025.04-devel-385-gada67-tests/isr_context_switch)
Testing 500 context switches triggered from ISR
INFO: timer running at 1000000 Hz
{ "threads": [{ "name": "t1", "stack_size":

Edit: On an nrf52840dongle the test runs successfully:

/home/chris/RIOTstuff/riot-isr-context/RIOT/dist/tools/pyterm/pyterm -p "/dev/ttyACM0" -b "115200" -ln "/tmp/pyterm-chris" -rn "2025-04-02_18.04.36-tests_isr_context_switch-nrf52840dongle" --no-reconnect --noprefix --no-repeat-command-on-empty-line 
Twisted not available, please install it if you want to use pyterm's JSON capabilities
Connect to serial port /dev/ttyACM0
Welcome to pyterm!
Type '/exit' to exit.
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
READY
Help: Press s to start test, r to print it is ready
START
main(): This is RIOT! (Version: 2025.04-devel-385-gada67-tests/isr_context_switch)
Testing 500 context switches triggered from ISR
INFO: timer running at 1000000 Hz
TEST PASSED
{ "threads": [{ "name": "main", "stack_size": 1024, "stack_used": 468}]}
{ "threads": [{ "name": "t2", "stack_size": 256, "stack_used": 168}]}
{ "threads": [{ "name": "t1", "stack_size": 256, "stack_used": 168}]}
s

make: Verzeichnis „/home/chris/RIOTstuff/riot-isr-context/RIOT/tests/isr_context_switch“ wird verlassen

crasbe avatar Apr 02 '25 16:04 crasbe

Maybe relevant: I rebased the PR before testing (otherwise I couldn't have tested with the Xiao).

Testing was also positive on nucleo-f030r8 and nucleo-f302r8. I can test it on other Nucleos as well if required.

crasbe avatar Apr 02 '25 18:04 crasbe

Murdock results

:heavy_check_mark: PASSED

eb1d7a371a7b4e66a3edfc0ee2d1651e3a293414 fixup! tests/core: add test for context switching from ISR

Success Failures Total Runtime
25 0 25 01m:49s

Artifacts

riot-ci avatar Apr 02 '25 19:04 riot-ci

Maybe relevant: I rebased the PR before testing (otherwise I couldn't have tested with the Xiao).

Yeah, with the PR being that old, it really makes sense to rebase it. I just did so now to make use of the "new" test folder structure and the periph_timer_query_freqs feature.

Testing was also positive on nucleo-f030r8 and nucleo-f302r8. I can test it on other Nucleos as well if required.

Since this is more about the implementation of the context switching, I would expect all MCUs of a given architecture to either fail or all to pass, unless there are bugs in the test or in the periph_timer implementation.

This is why I'm more than a bit puzzled about the test failure on the seeedstudio-xiao-nrf52840, when it does work on other boards with the same MCU.

Looks like on the samr21-xpro the test actually found an issue

It just seems to be slow:

2025-04-02 21:42:45,922 # Help: Press s to start test, r to print it is ready
s
2025-04-02 21:42:48,281 # START
2025-04-02 21:42:48,288 # main(): This is RIOT! (Version: 2025.04-devel-388-g40f90-tests/isr_context_switch)
2025-04-02 21:42:48,293 # Testing 500 context switches triggered from ISR
2025-04-02 21:42:48,296 # INFO: timer running at 1000000 Hz
2025-04-02 21:43:20,986 # TEST PASSED
2025-04-02 21:43:20,993 # { "threads": [{ "name": "main", "stack_size": 1024, "stack_used": 424 }]}

maribu avatar Apr 02 '25 19:04 maribu

Whoopsie:

main(): This is RIOT! (Version: 2025.04-devel-388-g40f90-tests/isr_context_switch)
Testing 500 context switches triggered from ISR
INFO: timer running at 1000000 Hz
timeout short: 10, timeout long: 1000
scheduler(): stack overflow detected, pid=2
*** RIOT kernel panic:
STACK OVERFLOW

*** halted.

Inside isr -2

maribu avatar Apr 02 '25 20:04 maribu

Interesting: Prefixing make with FEATURES_BLACKLIST=no_idle_thread solves the stack over flow on the samr21-xpro.

maribu avatar Apr 02 '25 20:04 maribu

I'm not sure what changed, but now the test passes on the seeedstudio-xiao-nrf52840, but it still resets into the bootloader after the test:

/home/chris/RIOTstuff/riot-isr-context/RIOT/dist/tools/pyterm/pyterm -p "/dev/ttyACM0" -b "115200" -ln "/tmp/pyterm-chris" -rn "2025-04-03_11.41.43-tests_isr_context_switch-seeedstudio-xiao-nrf52840" --no-reconnect --noprefix --no-repeat-command-on-empty-line 
Twisted not available, please install it if you want to use pyterm's JSON capabilities
Connect to serial port /dev/ttyACM0
Welcome to pyterm!
Type '/exit' to exit.
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
READY
START
main(): This is RIOT! (Version: 2025.04-devel-392-geb1d7-tests/isr_context_switch)
Testing 5000 context switches triggered from ISR
INFO: timer running at 16000000 Hz.
Timeout at start: 125, incrementing timeout in steps of: 31
s
TEST PASSED
{ "threads": [{ "name": "main", "stack_size": 1024, "stack_used": 504}]}
Serial port disconnected, waiting to get reconnected...
Try to reconnect to /dev/ttyACM0 again...
Reconnected to serial port /dev/ttyACM0
Timeout in expect script at "child.expect("Testing 500 context switches triggered from ISR")" (tests/core/isr_context_switch/tests/01-run.py:20)

make: *** [/home/chris/RIOTstuff/riot-isr-context/RIOT/makefiles/tests/tests.inc.mk:26: test] Fehler 1
make: Verzeichnis „/home/chris/RIOTstuff/riot-isr-context/RIOT/tests/core/isr_context_switch“ wird verlassen

Perhaps it has to do with stdio_cdc_acm? Many tests blacklist that.

crasbe avatar Apr 03 '25 09:04 crasbe

I will sadly have to change the test here again.

I'm able to cause a stack overflow when slightly changing the timing for pretty much any Cortex-M board. I would like to have this test fail reliably on every Cortex-M board, as long as the issue is there.

But the fact that this test can trigger the stack overflow if it hits the right timing in the context switch shows that the underlying strategy is working. I'll probably have to sweep with the timing gently over a wide range of timeouts and for each do enough repetitions to cause a stack overflow, if the stack would grow due to context switching. But that does sound like a pretty expensive test then.

maribu avatar Apr 10 '25 11:04 maribu

Maybe some more evidence: I've noticed that with BOARD=nrf52840dk the thread stacksize info would be printed for t1 and t2 when using stdio_cdc_acm (and a second USB cable), but not without it.

With

diff --git a/tests/core/isr_context_switch/main.c b/tests/core/isr_context_switch/main.c
index 13399c0a06..e522a85dc0 100644
--- a/tests/core/isr_context_switch/main.c
+++ b/tests/core/isr_context_switch/main.c
@@ -81,6 +81,7 @@ static void *t1_impl(void *unused)
     for (unsigned repetitions = 0; repetitions < TEST_REPETITIONS; repetitions++) {
         mutex_lock(&sig_t1);
     }
+    puts("t1 FINISHED");
     return NULL;
 }
 
@@ -91,6 +92,7 @@ static void *t2_impl(void *unused)
         mutex_lock(&sig_t2);
     }
     mutex_unlock(&sig_main);
+    puts("t2 FINISHED");
     return NULL;
 }

I now get no stacksize information any longer (and no extra output) with USEMODULE=stdio_cdc_acm, and without it:

2025-04-10 14:14:35,418 # START
2025-04-10 14:14:35,423 # main(): This is RIOT! (Version: 2025.04-devel-392-geb1d7-HEAD)
2025-04-10 14:14:35,427 # Testing 5000 context switches triggered from ISR
2025-04-10 14:14:35,431 # INFO: timer running at 16000000 Hz.
2025-04-10 14:14:35,436 # Timeout at start: 125, incrementing timeout in steps of: 31
2025-04-10 14:14:38,130 # TEST PASSED
2025-04-10 14:14:38,135 # { "threads": [{ "name": "main", "stack_size": 1024, "stack_used": 396 }]}
2025-04-10 14:14:38,138 # t2*** RIOT kernel panic:
2025-04-10 14:14:38,139 # MEM MANAGE HANDLER
2025-04-10 14:14:38,139 # 
2025-04-10 14:14:38,140 # *** halted.
2025-04-10 14:14:38,141 # 
2025-04-10 14:14:38,142 # Inside isr -12

mguetschow avatar Apr 10 '25 12:04 mguetschow