esp-adf SOLVED: An ESP32(-S3) fast fp32 division using a reciprocal inline assembly with arg/result passsed in fp32 regs for even more speed! (AUD-5741)

ESP32(-S3) fp32 division is notoriously slow. It can be made faster several times by using a reciprocal asm sequence, which is accurate to 1 ULP - precise enough for most cases. ESP32(-S3) ABI specifies passing both func's input args and output value in general-purpose regs (A2-A15) - even for floats, but for inline assembly in C that may not be the case - tested various scenarios and both input and output are passed in fp32 regs (F0-F15) where possible, which surely speeds things up :) This code was inspired by https://blog.llandsmeer.com/tech/2021/04/08/esp32-s2-fpu.html, which I significantly enhanced:

2 asm instructions less (no wfr/rfr)
no fixed fp32 regs (gcc can freely choose which ones - allows more optimizations)
no static keyword for recipsf2() (is visible outside its source file)

Here it is with Public Domain License:

__attribute__((always_inline)) inline
float recipsf2(float input) {
    float result, temp;
    asm(
        "recip0.s %0, %2\n"
        "const.s %1, 1\n"
        "msub.s %1, %2, %0\n"
        "madd.s %0, %0, %1\n"
        "const.s %1, 1\n"
        "msub.s %1, %2, %0\n"
        "maddn.s %0, %0, %1\n"
        :"=&f"(result),"=&f"(temp):"f"(input)
    );
    return result;
}

#define DIV(a, b) (a)*recipsf2(b)

Cheers,

f4lc0n

Fixed: Added & for the temp var so that it is mapped to a unique fp32 reg (in some recipsf2() usage cases it wasn't). Changed: Removed volatile after asm. Changed: The 1st maddn.s to madd.s so that it corresponds to the canonical reciprocal sequence in Xtensa ISA Summary on p. 113.

Sep 29 '24 17:09 f4lc0n-asm

Ideally, this could be incorporated into Xtensa GCC. Activating the fast float division using a reciprocal with something like #pragma fast_float_div.

Oct 02 '24 18:10 f4lc0n-asm

To test DIV(a,b), I replaced q=(b1-b2)/(a1-a2); with q=DIV(b1-b2,a1-a2); and q=b1+(d-a1)*((d-a2)*((b2-b3)/(a2-a3)-q)/(a3-a1)+q); with q=b1+(d-a1)*((d-a2)*DIV(DIV(b2-b3,a2-a3)-q,a3-a1)+q); - the resulting assembly was MUCH more optimized, NO automatic vars on the stack used and func execution cycles dropped from 397 to 277, i.e. 1.43× faster! See float plinc(float d) in plinc.c in https://github.com/espressif/esp-dsp/files/14334535/dcomp_v1.2.zip at https://github.com/espressif/esp-dsp/issues/76. Also measured just the 2 expressions: 202 vs. 80 cycles and 144 vs. 37 instructions - i.e. 2½× faster and 3.9× smaller! Quite a surprise! :) If you know the Xtensa GCC folks, please, let them know! Thank you!

Oct 04 '24 17:10 f4lc0n-asm

@f4lc0n-asm We have synchronized your test to GCC folks and will inform you as soon as there is any news

Feb 21 '25 04:02 houhaiyan

@f4lc0n-asm , it seems you want to override __divsf3() function from libgcc.a (because the division is performed by calling this function).

You can easily achieve this without introducing DIV macro, all you need is to place somewhere in your sources:

#include "esp_attr.h"

IRAM_ATTR
float __divsf3(float a, float b) {
    float inversed_b, tmp;
    __asm__ volatile (
        "recip0.s %0, %2\n"
        "const.s %1, 1\n"
        "msub.s %1, %2, %0\n"
        "madd.s %0, %0, %1\n"
        "const.s %1, 1\n"
        "msub.s %1, %2, %0\n"
        "maddn.s %0, %0, %1\n"
        :"=&f"(inversed_b),"=&f"(tmp):"f"(b)
    );
    return a * inversed_b;
}

As a result, your implementation will be picked by linker, and build/*.map file can confirm that.

Feb 21 '25 08:02 Lapshin

@jcmvbkbc could you please take a look?

Is libgcc implementation about precision but not speed?

Feb 21 '25 08:02 Lapshin

@Lapshin Hello and thank you for the elegant solution suggestion. I included it my source and my .map file contains these lines:

.text.__divsf3 0x00000000 0x2a esp-idf/main/libmain.a(plinc.c.obj)
…
0x40002274 __divsf3 = 0x40002274

I then checked how these 2 expressions were compiled;

q=(b1-b2)/(a1-a2);
q=b1+(d-a1)*((d-a2)*((b2-b3)/(a2-a3)-q)/(a3-a1)+q);

When I dissasembled the resulting .elf file via objdump -d -x -s foo.elf, I got the following assembly listing:

l32r    a6, 40378728 <_iram_text_start+0x324> (3fc962f0 <y>)
ssi     f0, a1, 24
sub.s   f0, f3, f4
add.n   a14, a6, a14
add.n   a8, a6, a10
lsi     f1, a14, 0
lsi     f5, a8, 0
rfr     a11, f0
sub.s   f0, f1, f5
addi    a8, a13, 2
addx4   a12, a8, a12
ssi     f3, a1, 20
rfr     a10, f0
lsi     f0, a12, 0
ssi     f4, a1, 8
ssi     f5, a1, 4
ssi     f1, a1, 16
ssi     f0, a1, 0
slli    a2, a8, 2
l32r    a8, 40378758 <_iram_text_start+0x354> (40002274 <__divsf3>)
callx8  a8
lsi     f4, a1, 8
lsi     f0, a1, 0
add.n   a6, a6, a2
sub.s   f1, f4, f0
lsi     f5, a1, 4
lsi     f6, a6, 0
mov.n   a8, a10
rfr     a11, f1
sub.s   f1, f5, f6
ssi     f4, a1, 12
s32i.n  a8, a1, 4
rfr     a10, f1
l32r    a8, 40378758 <_iram_text_start+0x354> (40002274 <__divsf3>)
callx8  a8
lsi     f0, a1, 24
lsi     f4, a1, 12
lsi     f3, a1, 20
sub.s   f4, f0, f4
lsi     f1, a1, 4
ssi     f0, a1, 12
lsi     f0, a1, 0
wfr     f2, a10
sub.s   f0, f0, f3
sub.s   f5, f2, f1
ssi     f3, a1, 8
rfr     a11, f0
mul.s   f0, f5, f4
rfr     a10, f0
l32r    a8, 40378758 <_iram_text_start+0x354> (40002274 <__divsf3>)
callx8  a8
lsi     f1, a1, 4
wfr     f0, a10
add.s   f4, f0, f1
lsi     f3, a1, 8
lsi     f0, a1, 12
lsi     f1, a1, 16
sub.s   f3, f0, f3
s32i.n  a10, a1, 0
madd.s  f1, f3, f4

Here you can see that __divsf3() is called thrice with args passed in a10 and a11 - so it must be the original and slow libgcc.a! Additionally, you can see the multiple load/store for auto vars on stack to save/restore regs.

On the other hand, when using the DIV() macro, I got the following assembly listing:

l32r    a9, 40378728 <_iram_text_start+0x324> (3fc962f0 <y>)
add.n   a8, a9, a10
addx4   a10, a13, a9
lsi     f6, a8, 0
lsi     f5, a10, 8
add.n   a9, a9, a14
sub.s   f5, f6, f5
addx4   a13, a13, a11
lsi     f1, a9, 0
lsi     f7, a13, 8
sub.s   f6, f1, f6
sub.s   f8, f3, f7
recip0.s        f9, f8
const.s f10, 1
msub.s  f10, f8, f9
madd.s  f9, f9, f10
const.s f10, 1
msub.s  f10, f8, f9
maddn.s f9, f9, f10
mul.s   f5, f5, f9
sub.s   f8, f4, f3
sub.s   f3, f0, f3
recip0.s        f9, f8
const.s f10, 1
msub.s  f10, f8, f9
madd.s  f9, f9, f10
const.s f10, 1
msub.s  f10, f8, f9
maddn.s f9, f9, f10
msub.s  f5, f6, f9
sub.s   f7, f7, f4
sub.s   f4, f0, f4
recip0.s        f8, f7
const.s f10, 1
msub.s  f10, f7, f8
madd.s  f8, f8, f10
const.s f10, 1
msub.s  f10, f7, f8
maddn.s f8, f8, f10
mul.s   f0, f3, f5
mul.s   f0, f0, f8
madd.s  f0, f6, f9
madd.s  f1, f4, f0

There are no __divsf3() calls there and it executes 2½× faster! Also, there are no load/store for auto vars on stack, just lean and fast code! :)

It seems, that the linker will need more convincing to pick up the custom __divsf3(). But since the Xtensa libgcc.a is written in assembly, it may be solved by using the .weak directive for the __divsf3 symbol…

Looking forward to your reply!

Feb 21 '25 21:02 f4lc0n-asm

@f4lc0n-asm , sorry, the approach I posted works for esp32. For esp32s3, additionally, the line in components/esp_rom/esp32s3/ld/esp32s3.rom.libgcc.ld needs to be removed to make it work.

Yes, this approach has some overhead related to an additional function call. However, with this, you can achieve what you want without refactoring your code to use macros.

Feb 22 '25 03:02 Lapshin

Is libgcc implementation about precision but not speed?

That's correct. Quote from the ISA book: "The sequences for DIV.[SD] and SQRT.[SD] provide IEEE754 compliant results, in- cluding status flags, in all four round modes. ... The sequences for RECIP.[SD] and RSQRT.[SD] provide non-IEEE results that are within 1-ulp for RECIP.[SD] and 2-ulp for RSQRT.[SD] of a fully IEEE accurate result. They begin with RECIP0.[SD] or RSQRT0.[SD] and continue with just the Newton-Raphson steps. They do not have the extra instructions to provide the precisely rounded result and are therefore faster and take up less code space. One additional multiply converts these sequences into a fast divide or square root with somewhat less accuracy."

Feb 22 '25 06:02 jcmvbkbc

But since the Xtensa libgcc.a is written in assembly, it may be solved by using the .weak directive for the __divsf3 symbol…

.weak is not going to help in picking one definition in a situation where you have the symbol defined in multiple libraries. First library that provides required definition for the symbol will always win.

Feb 22 '25 06:02 jcmvbkbc

@Lapshin The esp32s3.rom.libgcc.ld fix worked.

I also tried to inline your function by prepending __attribute__((always_inline)) inline, but it was never inlined - see below:

__attribute__((always_inline)) inline
float __divsf3(float a, float b) {
    float result, temp;
    asm(
        "recip0.s %0, %2\n"
        "const.s %1, 1\n"
        "msub.s %1, %2, %0\n"
        "madd.s %0, %0, %1\n"
        "const.s %1, 1\n"
        "msub.s %1, %2, %0\n"
        "maddn.s %0, %0, %1\n"
	"mul.s %0, %3, %0\n"
	:"=&f"(result),"=&f"(temp):"f"(b),"f"(a)
    );
    return result;
}

What I am after is speed for my DSP algos, so for me the short inlined DIV() macro is much more preferable.

Could you add some switch to GCC, say -ffast-float-div and/or #pragma fast-float-div, which would use the inlined DIV() macro or inlined __divsf3() instead? That would be a great boon to DSP programmers indeed! :) 1-ulp reciprocal precision is more than enough for DSP stuff…

P.S.: I also saw an undocumented -ffast-math switch in Xtensa GCC help by issuing xtensa-esp32-elf-gcc --help -v > help.txt, but that did not speed up the float division for me.

Feb 22 '25 13:02 f4lc0n-asm

esp-adf esp-adf copied to clipboard

SOLVED: An ESP32(-S3) fast fp32 division using a reciprocal inline assembly with arg/result passsed in fp32 regs for even more speed! (AUD-5741)

esp-adf
esp-adf copied to clipboard