tests: add test for context switch from ISR
Contribution description
The new tests triggers context switching from ISR with a high load with enabled stack tests. If the context switching code does not disable IRQs until a context switch is fully completed, an IRQ might be served while not all the saved context is consumed from the stack. Normally, this is not an issue, as the context switch is just resumed after the ISR completes. However, under high load data on the stack can accumulate and eventually overflow.
This tests puts the board under high IRQ load to see if such stack overflows can happen.
Testing procedure
make BOARD=<SOME_BOARD> -C tests/isr_context_switch flash test
Issues/PRs references
Came up in https://github.com/RIOT-OS/RIOT/pull/16211
Looks good, can you add a README to the test? Something like your PR description would be enough IMO.
All my boards passed the test, I'll ACK once there is a small README, please trigger the CI afterward @maribu
ping @maribu
Sorry for the delay. Sadly, there is indeed a bug in the AVR context switch implementation, as @nandojve pointed out, that this test doesn't trigger.
I'll try to spice up the test first. @nandojve found a way to do so, but it depends on a platform being able to trigger IRQs on GPIOs by configuring them as output and controlling them. While this works fine on AVR, I don't really want to have this test relying on this.
@nandojve: I think this was actually the approach used for context switching some time ago for AVR, but it resulting in a GPIO pin and a precious external interrupt being used. It was actually quite smart, as the context pushed to stack would be identical regardless whether it was triggered from within an ISR, or synchron by a "soft IRQ". Reverting to this implementation would solve the issue again. I have some ideas on how to fake a soft IRQ without relying on real hardware. This would solve the issue and use avrlibc for saving and restoring the context, getting rid of a lot of inline assembly in RIOT. It would also ease adding new AVR boards that are already supported by the avrlibc.
I'll rebase on master to get make compile-commands in this branch.
I'll rebase on
masterto getmake compile-commandsin this branch.
@maribu , I would like remind that if there is a possibility to use nested ISR I would say IRQ0 should be the answer. Why? when user enable IRQ any high priority can preempt current one. If context switch is on IRQ0 ( highest priority ) no one can preempt it and the bug is solved for that scenario too.
OK, I basically rewrote the test. This test is no able to confirm the issue for ATxmega board. However, I still fail to trigger the stack overflow on an Arduino Mega2560.
Since I had to use a hardware timer directly to trigger the issue reliably (well, for my atxmega-a1u-xpro at least), the test ended up being a bit more painful. It might be needed to adjust some test parameters for some platforms to get the test working :-(
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.
still relevant I guess
Looks like on the samr21-xpro the test actually found an issue
Tested successfully with an arduino-uno and nrf52dk target. On the seeedstudio-xiao-nrf52840, this test will cause the device to crash and return to the bootloader.
This is what I get in hterm, I did not press anything.
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
READY
Help: Press s to start test, r to print it is ready
START
main(): This is RIOT! (Version: 2025.04-devel-385-gada67-tests/isr_context_switch)
Testing 500 context switches triggered from ISR
INFO: timer running at 1000000 Hz
{ "threads": [{ "name": "t1", "stack_size":
Edit: On an nrf52840dongle the test runs successfully:
/home/chris/RIOTstuff/riot-isr-context/RIOT/dist/tools/pyterm/pyterm -p "/dev/ttyACM0" -b "115200" -ln "/tmp/pyterm-chris" -rn "2025-04-02_18.04.36-tests_isr_context_switch-nrf52840dongle" --no-reconnect --noprefix --no-repeat-command-on-empty-line
Twisted not available, please install it if you want to use pyterm's JSON capabilities
Connect to serial port /dev/ttyACM0
Welcome to pyterm!
Type '/exit' to exit.
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
READY
Help: Press s to start test, r to print it is ready
START
main(): This is RIOT! (Version: 2025.04-devel-385-gada67-tests/isr_context_switch)
Testing 500 context switches triggered from ISR
INFO: timer running at 1000000 Hz
TEST PASSED
{ "threads": [{ "name": "main", "stack_size": 1024, "stack_used": 468}]}
{ "threads": [{ "name": "t2", "stack_size": 256, "stack_used": 168}]}
{ "threads": [{ "name": "t1", "stack_size": 256, "stack_used": 168}]}
s
make: Verzeichnis „/home/chris/RIOTstuff/riot-isr-context/RIOT/tests/isr_context_switch“ wird verlassen
Maybe relevant: I rebased the PR before testing (otherwise I couldn't have tested with the Xiao).
Testing was also positive on nucleo-f030r8 and nucleo-f302r8. I can test it on other Nucleos as well if required.
Murdock results
:heavy_check_mark: PASSED
eb1d7a371a7b4e66a3edfc0ee2d1651e3a293414 fixup! tests/core: add test for context switching from ISR
| Success | Failures | Total | Runtime |
|---|---|---|---|
| 25 | 0 | 25 | 01m:49s |
Artifacts
Maybe relevant: I rebased the PR before testing (otherwise I couldn't have tested with the Xiao).
Yeah, with the PR being that old, it really makes sense to rebase it. I just did so now to make use of the "new" test folder structure and the periph_timer_query_freqs feature.
Testing was also positive on nucleo-f030r8 and nucleo-f302r8. I can test it on other Nucleos as well if required.
Since this is more about the implementation of the context switching, I would expect all MCUs of a given architecture to either fail or all to pass, unless there are bugs in the test or in the periph_timer implementation.
This is why I'm more than a bit puzzled about the test failure on the seeedstudio-xiao-nrf52840, when it does work on other boards with the same MCU.
Looks like on the samr21-xpro the test actually found an issue
It just seems to be slow:
2025-04-02 21:42:45,922 # Help: Press s to start test, r to print it is ready
s
2025-04-02 21:42:48,281 # START
2025-04-02 21:42:48,288 # main(): This is RIOT! (Version: 2025.04-devel-388-g40f90-tests/isr_context_switch)
2025-04-02 21:42:48,293 # Testing 500 context switches triggered from ISR
2025-04-02 21:42:48,296 # INFO: timer running at 1000000 Hz
2025-04-02 21:43:20,986 # TEST PASSED
2025-04-02 21:43:20,993 # { "threads": [{ "name": "main", "stack_size": 1024, "stack_used": 424 }]}
Whoopsie:
main(): This is RIOT! (Version: 2025.04-devel-388-g40f90-tests/isr_context_switch)
Testing 500 context switches triggered from ISR
INFO: timer running at 1000000 Hz
timeout short: 10, timeout long: 1000
scheduler(): stack overflow detected, pid=2
*** RIOT kernel panic:
STACK OVERFLOW
*** halted.
Inside isr -2
Interesting: Prefixing make with FEATURES_BLACKLIST=no_idle_thread solves the stack over flow on the samr21-xpro.
I'm not sure what changed, but now the test passes on the seeedstudio-xiao-nrf52840, but it still resets into the bootloader after the test:
/home/chris/RIOTstuff/riot-isr-context/RIOT/dist/tools/pyterm/pyterm -p "/dev/ttyACM0" -b "115200" -ln "/tmp/pyterm-chris" -rn "2025-04-03_11.41.43-tests_isr_context_switch-seeedstudio-xiao-nrf52840" --no-reconnect --noprefix --no-repeat-command-on-empty-line
Twisted not available, please install it if you want to use pyterm's JSON capabilities
Connect to serial port /dev/ttyACM0
Welcome to pyterm!
Type '/exit' to exit.
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
Help: Press s to start test, r to print it is ready
READY
START
main(): This is RIOT! (Version: 2025.04-devel-392-geb1d7-tests/isr_context_switch)
Testing 5000 context switches triggered from ISR
INFO: timer running at 16000000 Hz.
Timeout at start: 125, incrementing timeout in steps of: 31
s
TEST PASSED
{ "threads": [{ "name": "main", "stack_size": 1024, "stack_used": 504}]}
Serial port disconnected, waiting to get reconnected...
Try to reconnect to /dev/ttyACM0 again...
Reconnected to serial port /dev/ttyACM0
Timeout in expect script at "child.expect("Testing 500 context switches triggered from ISR")" (tests/core/isr_context_switch/tests/01-run.py:20)
make: *** [/home/chris/RIOTstuff/riot-isr-context/RIOT/makefiles/tests/tests.inc.mk:26: test] Fehler 1
make: Verzeichnis „/home/chris/RIOTstuff/riot-isr-context/RIOT/tests/core/isr_context_switch“ wird verlassen
Perhaps it has to do with stdio_cdc_acm? Many tests blacklist that.
I will sadly have to change the test here again.
I'm able to cause a stack overflow when slightly changing the timing for pretty much any Cortex-M board. I would like to have this test fail reliably on every Cortex-M board, as long as the issue is there.
But the fact that this test can trigger the stack overflow if it hits the right timing in the context switch shows that the underlying strategy is working. I'll probably have to sweep with the timing gently over a wide range of timeouts and for each do enough repetitions to cause a stack overflow, if the stack would grow due to context switching. But that does sound like a pretty expensive test then.
Maybe some more evidence: I've noticed that with BOARD=nrf52840dk the thread stacksize info would be printed for t1 and t2 when using stdio_cdc_acm (and a second USB cable), but not without it.
With
diff --git a/tests/core/isr_context_switch/main.c b/tests/core/isr_context_switch/main.c
index 13399c0a06..e522a85dc0 100644
--- a/tests/core/isr_context_switch/main.c
+++ b/tests/core/isr_context_switch/main.c
@@ -81,6 +81,7 @@ static void *t1_impl(void *unused)
for (unsigned repetitions = 0; repetitions < TEST_REPETITIONS; repetitions++) {
mutex_lock(&sig_t1);
}
+ puts("t1 FINISHED");
return NULL;
}
@@ -91,6 +92,7 @@ static void *t2_impl(void *unused)
mutex_lock(&sig_t2);
}
mutex_unlock(&sig_main);
+ puts("t2 FINISHED");
return NULL;
}
I now get no stacksize information any longer (and no extra output) with USEMODULE=stdio_cdc_acm, and without it:
2025-04-10 14:14:35,418 # START
2025-04-10 14:14:35,423 # main(): This is RIOT! (Version: 2025.04-devel-392-geb1d7-HEAD)
2025-04-10 14:14:35,427 # Testing 5000 context switches triggered from ISR
2025-04-10 14:14:35,431 # INFO: timer running at 16000000 Hz.
2025-04-10 14:14:35,436 # Timeout at start: 125, incrementing timeout in steps of: 31
2025-04-10 14:14:38,130 # TEST PASSED
2025-04-10 14:14:38,135 # { "threads": [{ "name": "main", "stack_size": 1024, "stack_used": 396 }]}
2025-04-10 14:14:38,138 # t2*** RIOT kernel panic:
2025-04-10 14:14:38,139 # MEM MANAGE HANDLER
2025-04-10 14:14:38,139 #
2025-04-10 14:14:38,140 # *** halted.
2025-04-10 14:14:38,141 #
2025-04-10 14:14:38,142 # Inside isr -12