RIOT icon indicating copy to clipboard operation
RIOT copied to clipboard

native not float safe

Open LudwigKnuepfer opened this issue 11 years ago • 20 comments

When the FPU is used when an asynchronous context switch occurs, something goes wrong. Something (as far as quick testing could reveal): either the stack gets corrupted or a floating point exception occurs.

Test case: test_irq with a float instead of an int for the thread local variable (may need a few runs).

LudwigKnuepfer avatar Jan 13 '14 09:01 LudwigKnuepfer

just to collect all traces:

Core was generated by `../RIOT/examples/ccn-lite-client/bin/native/ccn-lite-client.elf grid5x5_c1 -t 4'.
Program terminated with signal 8, Arithmetic exception.
#0  0x4a127baf in vfprintf () from /lib/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-20.fc19.i686
(gdb) bt
#0  0x4a127baf in vfprintf () from /lib/libc.so.6
#1  0x4a14e193 in vsnprintf () from /lib/libc.so.6
#2  0x0804a8fc in printf ()
#3  0x0805017f in thread_print_all ()
#4  0x0804f61c in handle_input_line ()
#5  0x0804f51f in shell_run ()
#6  0x0804ef4d in riot_ccn_runner ()
#7  0x0804ef83 in main ()

mehlis avatar Jan 17 '14 18:01 mehlis

Oups... little FIX ME FIRST which wasn't fixed... I'll try to look at it asap.

kYc0o avatar Aug 04 '16 15:08 kYc0o

I'd say this is a won't fix and we close the issue. Anyone against?

kYc0o avatar Jul 17 '18 11:07 kYc0o

I'm against closing this one, as this is fixable (it just takes some heavy investigating). Just closing it with wontfix is a very lazy way of dealing with actual problems ;-).

miri64 avatar Jul 17 '18 12:07 miri64

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

stale[bot] avatar Aug 10 '19 08:08 stale[bot]

See https://github.com/RIOT-OS/RIOT/issues/495#issuecomment-405560534 and also https://github.com/RIOT-OS/RIOT/pull/11921#issuecomment-515882663 ;-)

miri64 avatar Sep 10 '19 12:09 miri64

I was about to open an issue when I saw this. On current master (e768a85f62) tests/thread_floathttps://github.com/RIOT-OS/RIOT/tree/master/tests/thread_float raises a floating point exception. I'm thinking this is the underlying issue experienced in #15870 and #15878.

My versions:

Operating System Environment
----------------------------
         Operating System: "Manjaro Linux" 
                   Kernel: Linux 5.10.19-1-MANJARO x86_64 unknown
             System shell: GNU bash, version 5.1.0(1)-release (x86_64-pc-linux-gnu)
             make's shell: GNU bash, version 5.1.0(1)-release (x86_64-pc-linux-gnu)

Installed compiler toolchains
-----------------------------
               native gcc: gcc (GCC) 10.2.0
        arm-none-eabi-gcc: arm-none-eabi-gcc (Arch Repository) 10.2.0
                  avr-gcc: missing
         mips-mti-elf-gcc: missing
           msp430-elf-gcc: missing
       riscv-none-elf-gcc: missing
  riscv64-unknown-elf-gcc: missing
     riscv-none-embed-gcc: missing
     xtensa-esp32-elf-gcc: missing
   xtensa-esp8266-elf-gcc: missing
                    clang: clang version 11.1.0

Installed compiler libs
-----------------------
     arm-none-eabi-newlib: "4.1.0"
      mips-mti-elf-newlib: missing
        msp430-elf-newlib: missing
    riscv-none-elf-newlib: missing
riscv64-unknown-elf-newlib: missing
  riscv-none-embed-newlib: missing
  xtensa-esp32-elf-newlib: missing
xtensa-esp8266-elf-newlib: missing
                 avr-libc: missing (missing)

Installed development tools
---------------------------
                   ccache: ccache version 4.2
                    cmake: cmake version 3.19.6
                 cppcheck: missing
                  doxygen: 1.9.1
                      git: git version 2.30.1
                     make: GNU Make 4.3
                  openocd: Open On-Chip Debugger 0.10.0
                   python: Python 3.9.2
                  python2: missing
                  python3: Python 3.9.2
                   flake8: error: /usr/bin/python3: No module named flake8
               coccinelle: missing

leandrolanzieri avatar Mar 15 '21 07:03 leandrolanzieri

I think I finally found what's going on with the corrupted FPU registers on native (I'm printing the ftag before every function call):

Found trace frame 29, tracepoint 2
$162 = "setcontext"
$163 = 0xffff
Found trace frame 30, tracepoint 5
$164 = "native_isr_entry"
$165 = 0xffff
Found trace frame 31, tracepoint 1
$166 = "makecontext"
$167 = 0xffff
Found trace frame 32, tracepoint 6
$168 = void
$169 = 0x1555                      <-- _native_sig_leave_tramp
Found trace frame 33, tracepoint 4
$170 = "swapcontext"
$171 = 0x1555                      <-- This `swapcontext` saves corrupted FPU state
Found trace frame 34, tracepoint 7
$172 = "native_irq_handler"
$173 = 0xffff
Found trace frame 35, tracepoint 2
$174 = "setcontext"
$175 = 0xffff
Found trace frame 36, tracepoint 1
$176 = "makecontext"
$177 = 0xffff
Found trace frame 37, tracepoint 4
$178 = "swapcontext"
$179 = 0xffff

AFAIU although the Linux Kernel avoids doing FPU operations, it might still perform some. In any case, the Kernel does not care about user space FPU states. On a "real" native IRQ (not the "RIOT" one we artificially create with makecontext), the native process will call a trampoline function (_native_sig_leave_tramp) to invoke the "RIOT" ISR (native_handle_irq). At the begininng of _native_sig_leave_tramp, the FPU regs are already corrupted because the kernel does not restore user space context. But for whatever reason we call swapcontext there, which successfully copies the corrupted FPU state into one of the thread stacks. At some point, setcontext or swapcontext will set the FPU register in a wrong state, which will eventually raise SIGFPE.

IMO the solution is to use setcontext instead of swapcontext in the trampoline function. But this would require to keep the state of the last thread somewhere.

jia200x avatar Sep 22 '22 07:09 jia200x

MO the solution is to use setcontext instead of swapcontext in the trampoline function. But this would require to keep the state of the last thread somewhere.

is there any point in the signal handling where the old context hasn't been corrupted, and can be stored?

kaspar030 avatar Sep 22 '22 08:09 kaspar030

is there any point in the signal handling where the old context hasn't been corrupted, and can be stored?

The corruption takes place between the end of the signal handler (_native_isr_entry) and the beginning of the trampoline function (_native_sig_leave_tramp).

Probably calling getcontext in the signal handler and setcontext in the trampoline should do the trick.

jia200x avatar Sep 22 '22 08:09 jia200x

hmmmm digging deeper into the topic, it seems the context arg of the signal handler stores the last valid FPU register state. In this patch they set the FP regs manually from uc_mcontext. Maybe that's the way to go.

jia200x avatar Sep 22 '22 09:09 jia200x

In this patch they set the FP regs manually from uc_mcontext.

I think I need more context - how will the LoongArch UAPI help us here? Or do you suggest that this is indeed a bug in the Linux x86 UAPI and the kernel needs to be fixed here?

benpicco avatar Dec 06 '22 16:12 benpicco

Hello.

What is the status of this issue? I am debugging a floating point exception on the native board that might be related. I am on 2024.10.

Are there any known workarounds?

erlingrj avatar Mar 24 '25 10:03 erlingrj

It looks like #21283 improved things quite a bit there, not sure if the issue is entirely solved with that yet.

benpicco avatar Mar 24 '25 16:03 benpicco

Unfortunately not, just tested on master with

diff --git a/tests/core/irq/main.c b/tests/core/irq/main.c
index 74a47a2f16..a4ab1fb9e1 100644
--- a/tests/core/irq/main.c
+++ b/tests/core/irq/main.c
@@ -24,13 +24,14 @@
 #include "thread.h"
 
 static char busy_stack[THREAD_STACKSIZE_MAIN];
-static volatile int busy, i, k;
+static volatile int busy;
+static volatile float i, k;
 
 void *busy_thread(void *arg)
 {
     (void) arg;
 
-    int j = 0;
+    float j = 0;
     puts("busy_thread starting");
 
     i = 0;
@@ -39,9 +40,9 @@ void *busy_thread(void *arg)
         k = j = ++i;
     }
 
-    printf("i: %i\n", i);
-    printf("j: %i\n", j);
-    printf("k: %i\n", k);
+    printf("i: %f\n", i);
+    printf("j: %f\n", j);
+    printf("k: %f\n", k);
 
     puts("SUCCESS");
 
$ make -C tests/core/irq test    
make: Entering directory '/home/mikolai/TUD/Code/RIOT/tests/core/irq'
/home/mikolai/TUD/Code/RIOT/tests/core/irq/bin/native/tests_irq.elf  tap0 
RIOT native interrupts/signals initialized.
RIOT native board initialized.
RIOT native hardware initialization complete.

main(): This is RIOT! (Version: 2025.04-devel-292-gcd03cc)
START
busy_thread created
xtimer_wait()
busy_thread starting
main: return
{ "threads": [{ "name": "idle", "stack_size": 8192, "stack_used": 436 }]}
{ "threads": [{ "name": "main", "stack_size": 12288, "stack_used": 2452 }]}
make[1]: *** [/home/mikolai/TUD/Code/RIOT/Makefile.include:877: cleanterm] Floating point exception
Unexpected end of file in expect script at "child.expect('SUCCESS')" (tests/core/irq/tests/01-run.py:9)

Process already stopped
make: *** [/home/mikolai/TUD/Code/RIOT/makefiles/tests/tests.inc.mk:26: test] Error 1
make: Leaving directory '/home/mikolai/TUD/Code/RIOT/tests/core/irq'

mguetschow avatar Mar 24 '25 16:03 mguetschow

@derMihai ran into another issue with FPU/SSE registers not correctly saved/restored on glibc's ucontext implementation and causing stack corruption. The code in case was nanopb, where GCC even with -Og decided to abuse FPU registers for reducing register pressure on non-floating point computations.

It is possible to use CFLAGS += -mgeneral-regs-only to prevent the use of FPU and SSE registers and solve the stack corruptions on context switch on native.

Sadly, -msoft-float will not work on GCC, as the ABI of the soft FPU there does make use of SSE registers. (It basically sticks with the hard float ABI.) On LLVM, it seems to be possible to actually use a soft FPU on x86_64 without using FPU/SSE registers.

maribu avatar Sep 17 '25 07:09 maribu

Is this bug only possible on x86? I am seeing some strange behavior in some code I am working on. The problem is with some code that is doing some floating point math on a STM32H7 using the hardware FPU.

Enoch247 avatar Nov 05 '25 20:11 Enoch247

This particular bug is indeed only possible on native.

There have been some crazy bugs with FPUs so far in the past, e.g. https://github.com/RIOT-OS/RIOT/pull/18641 or https://github.com/RIOT-OS/RIOT/pull/18697 come to my mind.

Are you sure you cannot reproduce the issue on other MCUs? Silicon errors aside, I would assume that FPU bugs would affect all ARM Cortex M7 / Cortex M4 CPUs, and not just STM32H7 MCUs.

maribu avatar Nov 05 '25 20:11 maribu

This particular bug is indeed only possible on native.

There have been some crazy bugs with FPUs so far in the past, e.g. #18641 or #18697 come to my mind.

Are you sure you cannot reproduce the issue on other MCUs? Silicon errors aside, I would assume that FPU bugs would affect all ARM Cortex M7 / Cortex M4 CPUs, and not just STM32H7 MCUs.

Thanks for the clarification. The bug (that I am dealing with) is difficult to understand, as almost any change I make in the code causes it to disappear. Yet it's intermittent enough that catching it with a debugger has also, so far, not been possible. Good to at least rule this out as a possible cause. Still digging...

Enoch247 avatar Nov 12 '25 00:11 Enoch247

as almost any change I make in the code causes it to disappear

I had a similar issue when the number of waitstates for Flash accesses was incorrect. For faster CPU frequencies you have to set a register to add waitstates. Sometimes it would run for hours, sometimes it would crash instantly, unrelated changes in the code changed the behavior significantly.

crasbe avatar Nov 12 '25 14:11 crasbe

Sorry for the off topic comments, but I did want to leave a breadcrumb for future reference... The short story is that my problem was not at all related to floats.

as almost any change I make in the code causes it to disappear

I had a similar issue when the number of waitstates for Flash accesses was incorrect. For faster CPU frequencies you have to set a register to add waitstates. Sometimes it would run for hours, sometimes it would crash instantly, unrelated changes in the code changed the behavior significantly.

Thanks for the heads up. I did double check the flash wait sates and played around with changing them, based on your advice. At current, the cause seems to be that ztimer timers are firing early for me. More on that here.

Enoch247 avatar Dec 12 '25 19:12 Enoch247