native not float safe
When the FPU is used when an asynchronous context switch occurs, something goes wrong. Something (as far as quick testing could reveal): either the stack gets corrupted or a floating point exception occurs.
Test case: test_irq with a float instead of an int for the thread local variable (may need a few runs).
just to collect all traces:
Core was generated by `../RIOT/examples/ccn-lite-client/bin/native/ccn-lite-client.elf grid5x5_c1 -t 4'.
Program terminated with signal 8, Arithmetic exception.
#0 0x4a127baf in vfprintf () from /lib/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-20.fc19.i686
(gdb) bt
#0 0x4a127baf in vfprintf () from /lib/libc.so.6
#1 0x4a14e193 in vsnprintf () from /lib/libc.so.6
#2 0x0804a8fc in printf ()
#3 0x0805017f in thread_print_all ()
#4 0x0804f61c in handle_input_line ()
#5 0x0804f51f in shell_run ()
#6 0x0804ef4d in riot_ccn_runner ()
#7 0x0804ef83 in main ()
Oups... little FIX ME FIRST which wasn't fixed... I'll try to look at it asap.
I'd say this is a won't fix and we close the issue. Anyone against?
I'm against closing this one, as this is fixable (it just takes some heavy investigating). Just closing it with wontfix is a very lazy way of dealing with actual problems ;-).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.
See https://github.com/RIOT-OS/RIOT/issues/495#issuecomment-405560534 and also https://github.com/RIOT-OS/RIOT/pull/11921#issuecomment-515882663 ;-)
I was about to open an issue when I saw this. On current master (e768a85f62) tests/thread_floathttps://github.com/RIOT-OS/RIOT/tree/master/tests/thread_float raises a floating point exception. I'm thinking this is the underlying issue experienced in #15870 and #15878.
My versions:
Operating System Environment
----------------------------
Operating System: "Manjaro Linux"
Kernel: Linux 5.10.19-1-MANJARO x86_64 unknown
System shell: GNU bash, version 5.1.0(1)-release (x86_64-pc-linux-gnu)
make's shell: GNU bash, version 5.1.0(1)-release (x86_64-pc-linux-gnu)
Installed compiler toolchains
-----------------------------
native gcc: gcc (GCC) 10.2.0
arm-none-eabi-gcc: arm-none-eabi-gcc (Arch Repository) 10.2.0
avr-gcc: missing
mips-mti-elf-gcc: missing
msp430-elf-gcc: missing
riscv-none-elf-gcc: missing
riscv64-unknown-elf-gcc: missing
riscv-none-embed-gcc: missing
xtensa-esp32-elf-gcc: missing
xtensa-esp8266-elf-gcc: missing
clang: clang version 11.1.0
Installed compiler libs
-----------------------
arm-none-eabi-newlib: "4.1.0"
mips-mti-elf-newlib: missing
msp430-elf-newlib: missing
riscv-none-elf-newlib: missing
riscv64-unknown-elf-newlib: missing
riscv-none-embed-newlib: missing
xtensa-esp32-elf-newlib: missing
xtensa-esp8266-elf-newlib: missing
avr-libc: missing (missing)
Installed development tools
---------------------------
ccache: ccache version 4.2
cmake: cmake version 3.19.6
cppcheck: missing
doxygen: 1.9.1
git: git version 2.30.1
make: GNU Make 4.3
openocd: Open On-Chip Debugger 0.10.0
python: Python 3.9.2
python2: missing
python3: Python 3.9.2
flake8: error: /usr/bin/python3: No module named flake8
coccinelle: missing
I think I finally found what's going on with the corrupted FPU registers on native (I'm printing the ftag before every function call):
Found trace frame 29, tracepoint 2
$162 = "setcontext"
$163 = 0xffff
Found trace frame 30, tracepoint 5
$164 = "native_isr_entry"
$165 = 0xffff
Found trace frame 31, tracepoint 1
$166 = "makecontext"
$167 = 0xffff
Found trace frame 32, tracepoint 6
$168 = void
$169 = 0x1555 <-- _native_sig_leave_tramp
Found trace frame 33, tracepoint 4
$170 = "swapcontext"
$171 = 0x1555 <-- This `swapcontext` saves corrupted FPU state
Found trace frame 34, tracepoint 7
$172 = "native_irq_handler"
$173 = 0xffff
Found trace frame 35, tracepoint 2
$174 = "setcontext"
$175 = 0xffff
Found trace frame 36, tracepoint 1
$176 = "makecontext"
$177 = 0xffff
Found trace frame 37, tracepoint 4
$178 = "swapcontext"
$179 = 0xffff
AFAIU although the Linux Kernel avoids doing FPU operations, it might still perform some. In any case, the Kernel does not care about user space FPU states.
On a "real" native IRQ (not the "RIOT" one we artificially create with makecontext), the native process will call a trampoline function (_native_sig_leave_tramp) to invoke the "RIOT" ISR (native_handle_irq). At the begininng of _native_sig_leave_tramp, the FPU regs are already corrupted because the kernel does not restore user space context. But for whatever reason we call swapcontext there, which successfully copies the corrupted FPU state into one of the thread stacks. At some point, setcontext or swapcontext will set the FPU register in a wrong state, which will eventually raise SIGFPE.
IMO the solution is to use setcontext instead of swapcontext in the trampoline function. But this would require to keep the state of the last thread somewhere.
MO the solution is to use
setcontextinstead ofswapcontextin the trampoline function. But this would require to keep the state of the last thread somewhere.
is there any point in the signal handling where the old context hasn't been corrupted, and can be stored?
is there any point in the signal handling where the old context hasn't been corrupted, and can be stored?
The corruption takes place between the end of the signal handler (_native_isr_entry) and the beginning of the trampoline function (_native_sig_leave_tramp).
Probably calling getcontext in the signal handler and setcontext in the trampoline should do the trick.
hmmmm digging deeper into the topic, it seems the context arg of the signal handler stores the last valid FPU register state. In this patch they set the FP regs manually from uc_mcontext. Maybe that's the way to go.
In this patch they set the FP regs manually from
uc_mcontext.
I think I need more context - how will the LoongArch UAPI help us here? Or do you suggest that this is indeed a bug in the Linux x86 UAPI and the kernel needs to be fixed here?
Hello.
What is the status of this issue? I am debugging a floating point exception on the native board that might be related. I am on 2024.10.
Are there any known workarounds?
It looks like #21283 improved things quite a bit there, not sure if the issue is entirely solved with that yet.
Unfortunately not, just tested on master with
diff --git a/tests/core/irq/main.c b/tests/core/irq/main.c
index 74a47a2f16..a4ab1fb9e1 100644
--- a/tests/core/irq/main.c
+++ b/tests/core/irq/main.c
@@ -24,13 +24,14 @@
#include "thread.h"
static char busy_stack[THREAD_STACKSIZE_MAIN];
-static volatile int busy, i, k;
+static volatile int busy;
+static volatile float i, k;
void *busy_thread(void *arg)
{
(void) arg;
- int j = 0;
+ float j = 0;
puts("busy_thread starting");
i = 0;
@@ -39,9 +40,9 @@ void *busy_thread(void *arg)
k = j = ++i;
}
- printf("i: %i\n", i);
- printf("j: %i\n", j);
- printf("k: %i\n", k);
+ printf("i: %f\n", i);
+ printf("j: %f\n", j);
+ printf("k: %f\n", k);
puts("SUCCESS");
$ make -C tests/core/irq test
make: Entering directory '/home/mikolai/TUD/Code/RIOT/tests/core/irq'
/home/mikolai/TUD/Code/RIOT/tests/core/irq/bin/native/tests_irq.elf tap0
RIOT native interrupts/signals initialized.
RIOT native board initialized.
RIOT native hardware initialization complete.
main(): This is RIOT! (Version: 2025.04-devel-292-gcd03cc)
START
busy_thread created
xtimer_wait()
busy_thread starting
main: return
{ "threads": [{ "name": "idle", "stack_size": 8192, "stack_used": 436 }]}
{ "threads": [{ "name": "main", "stack_size": 12288, "stack_used": 2452 }]}
make[1]: *** [/home/mikolai/TUD/Code/RIOT/Makefile.include:877: cleanterm] Floating point exception
Unexpected end of file in expect script at "child.expect('SUCCESS')" (tests/core/irq/tests/01-run.py:9)
Process already stopped
make: *** [/home/mikolai/TUD/Code/RIOT/makefiles/tests/tests.inc.mk:26: test] Error 1
make: Leaving directory '/home/mikolai/TUD/Code/RIOT/tests/core/irq'
@derMihai ran into another issue with FPU/SSE registers not correctly saved/restored on glibc's ucontext implementation and causing stack corruption. The code in case was nanopb, where GCC even with -Og decided to abuse FPU registers for reducing register pressure on non-floating point computations.
It is possible to use CFLAGS += -mgeneral-regs-only to prevent the use of FPU and SSE registers and solve the stack corruptions on context switch on native.
Sadly, -msoft-float will not work on GCC, as the ABI of the soft FPU there does make use of SSE registers. (It basically sticks with the hard float ABI.) On LLVM, it seems to be possible to actually use a soft FPU on x86_64 without using FPU/SSE registers.
Is this bug only possible on x86? I am seeing some strange behavior in some code I am working on. The problem is with some code that is doing some floating point math on a STM32H7 using the hardware FPU.
This particular bug is indeed only possible on native.
There have been some crazy bugs with FPUs so far in the past, e.g. https://github.com/RIOT-OS/RIOT/pull/18641 or https://github.com/RIOT-OS/RIOT/pull/18697 come to my mind.
Are you sure you cannot reproduce the issue on other MCUs? Silicon errors aside, I would assume that FPU bugs would affect all ARM Cortex M7 / Cortex M4 CPUs, and not just STM32H7 MCUs.
This particular bug is indeed only possible on
native.There have been some crazy bugs with FPUs so far in the past, e.g. #18641 or #18697 come to my mind.
Are you sure you cannot reproduce the issue on other MCUs? Silicon errors aside, I would assume that FPU bugs would affect all ARM Cortex M7 / Cortex M4 CPUs, and not just STM32H7 MCUs.
Thanks for the clarification. The bug (that I am dealing with) is difficult to understand, as almost any change I make in the code causes it to disappear. Yet it's intermittent enough that catching it with a debugger has also, so far, not been possible. Good to at least rule this out as a possible cause. Still digging...
as almost any change I make in the code causes it to disappear
I had a similar issue when the number of waitstates for Flash accesses was incorrect. For faster CPU frequencies you have to set a register to add waitstates. Sometimes it would run for hours, sometimes it would crash instantly, unrelated changes in the code changed the behavior significantly.
Sorry for the off topic comments, but I did want to leave a breadcrumb for future reference... The short story is that my problem was not at all related to floats.
as almost any change I make in the code causes it to disappear
I had a similar issue when the number of waitstates for Flash accesses was incorrect. For faster CPU frequencies you have to set a register to add waitstates. Sometimes it would run for hours, sometimes it would crash instantly, unrelated changes in the code changed the behavior significantly.
Thanks for the heads up. I did double check the flash wait sates and played around with changing them, based on your advice. At current, the cause seems to be that ztimer timers are firing early for me. More on that here.