WLED [INFO] Code execution speed considerations for developers

I want to collect some info here about things I have learned while writing code for the ESP32 family MCUs. Please feel free to add to this.

This is a work in progress.

Comparison of basic operations on the CPU architectures

Operation	ESP32@240MHz (MOPS/s)	S3@240MHz (MOPS/s)	S2@240MHz (MOPS/s)	C3@160MHz (MOPS/s)
Integer Addition	237.76	237.99	182.17	127.08
Integer Multiply	237.05	238.06	182.17	120.74
Integer Division	118.94	119.03	101.29	4.63
Integer Multiply-Add	158.49	158.66	136.63	127.22
64bit Integer Addition	19.50	20.81	18.11	36.82
64bit Integer Multiply	27.55	30.22	27.79	15.50
64bit Integer Division	2.71	2.71	2.65	1.02
64bit Integer Multiply-Add	19.80	21.88	19.16	20.30
Float Addition	237.55	238.04	7.77	1.93
Float Multiply	237.69	237.97	4.14	1.24
Float Division	1.42	4.47	0.86	0.79
Float Multiply-Add	474.85	475.91	6.43	1.76
Double Addition	6.50	6.18	6.51	1.51
Double Multiply	2.23	2.37	2.23	0.70
Double Division	0.48	0.54	0.30	0.41
Double Multiply-Add	5.65	5.61	5.65	1.40

This table was generated using code from https://esp32.com/viewtopic.php?p=82090#

Even though the ESP32 and the S3 have hardware floating point units, they still do floating point division in software so it should be avoided in speed critical functions.

Edit (softhack007): "Float Multiply-Add" uses a special CPU instruction that combines addition and multiplication. Its generated by the compiler for expressions like a = a + b * C;

As to why integer divisions on the C3 are so slow is unknown, the datasheet clearly states that it can do 32-bit integer division in hardware.

Bit shifts vs. division

Bit shifts are always faster than doing a division as it is a single-instruction command. The compiler will replace divisions by bit-shifts wherever possible, so var / 256 is equivalent to var >> 8 if var is unsigned. If it is a signed integer, it is only equivalent if the value of var is positive and this fact is known to be always the case at compile time. The reason is: -200/256=0 and -200>>8=-1. So when using signed integers and a bit-shift is possible it is better to do it explicitly instead of leaving it to the compiler. (please correct me if I am wrong here)

Fixed point vs. float

Using fixed point math is less accurate but for most operations it is accurate enough and it runs much faster especially when doing divisions. When doing mixed-math there is a pitfall: casting negative floats into unsigned integers is undefined and leads to problems on some CPUs. https://embeddeduse.com/2013/08/25/casting-a-negative-float-to-an-unsigned-int/ To avoid this problem, explicitly cast a float into int before assigning it to an unsigned integer.

Modulo Operator: %

The modulo operator uses several instructions. A modulo of 2^i can be replaced with a 'bitwise and' or & operator which is a single instruction. The rule is n % 2^i = n & (2^i - 1). For example n % 2048 = n & 2047

Oct 18 '24 05:10 DedeHai

@DedeHai I was wondering if the tables from https://esp32.com/viewtopic.php?p=82090# are still correct, especially for the float multiply vs. float divide. The table comes from a time when FPU support for esp32 was broken. https://github.com/espressif/esp-idf/issues/96

It seems correct that "float divide" is a lot slower than multiply by inverse, and I think (please correct me) the compiler can generate this optimization automatically. However, the difference today should be like "8-10 times slower" but not a factor of almost 100x.

EDIT: there was a PR for esp-idf that corrected usage of FPU instructions in esp-idf v4. Maybe it would be useful to add a column to the table, for comparing "esp32 esp-idf v3.x" vs. "esp32 esp-idf v4.x"

https://github.com/espressif/esp-idf/commit/db6a30b446f10352fd1e2f2af2fdc814ae266f55

Oct 18 '24 10:10 softhack007

There is an additional thing worth mentioning:

floating point "literals"

According to c++ semantics, an expression like "if ( x > 1.0)" (with float x) is first "promoted" to double before evaluation, which makes it SLOW. This can be avoided

by appending "f" to the literal --> if ( x > 1.0) --> if ( x > 1.0f), or
by casting constants to float x += M_PI --> x += float(M_PI), or
by creating your constants with "constexpr float" instead of "#define": #define MY_LIMIT 3.14 --> constexpr float MY_LIMIT = 3.14; (notice that appending "f" is not needed here).

You can check the code for such "double promotions" by adding -Wdouble-promotion to build_flags

https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wdouble-promotion

Oct 18 '24 11:10 softhack007

use `constexpr`

Using constexpr is a nice way to optimize without obfuscating the code too much.

In contrast to const which often calculates at runtime, constexpr is guaranteed to be calculated by the compiler, so the calculation will never be part of the binary.

https://en.cppreference.com/w/cpp/language/constexpr

examples

    constexpr float f = 23.0f;
    constexpr float g = 33.0f;
    constexpr float h = f / g; // is computed by the compiler, so it needs ZERO cycles at runtime
    printf("%f\n", h);
}

You can even create functions that are constexpr

// C++11 constexpr functions use recursion rather than iteration
constexpr int factorial(int n)
{
    return n <= 1 ? 1 : (n * factorial(n - 1));
}

static constexpr unsigned getPaletteCount()  { return 13 + GRADIENT_PALETTE_COUNT; }  // zero-cost "getter" function

Oct 18 '24 11:10 softhack007

...and the classical one:

avoid 8bit and 16bit integers for local variables

uint8_t, int16_t and friends are useful to save RAM for global data, however - in contrast to older 8bit Arduino processors like AVR - these types are slower than the native types int or unsigned.

Update: If 8bit math (with roll-over on 255) is needed, 8bit types should be used - it's still faster than manually checking and adjusting 8bit overflows.

The reason is that esp32 processors have 32bit registers and 32bit instructions, so any calculation on uint8_t requires some extra effort to correctly emulate 8bit (especially for overflow). Technically uint8_t c = a + b; becomes something like uint8_t c = ((a & 0xFF) + (b & 0xFF)) & 0xFF; . And its even more complicated for signed int8_t...

for more info: https://en.cppreference.com/w/cpp/types/integer

Oct 18 '24 12:10 softhack007

...and the classical one:

avoid 8bit and 16bit integers for local variables

This one is tricky. The code needs not rely on overflows as it does in WLED.

Oct 18 '24 14:10 blazoncek

I was wondering if the tables from https://esp32.com/viewtopic.php?p=82090# are still correct.

They are for current WLED, I generatad this yesterday by inserting the code to 0.15. I can add IDF 4 once we move there.

The 8bit/16bit is a bit more elaborate. In general what you write is true but it has 8bit/16bit instructions too. So yes, avoid 8bit but manually checking and adjusting overflows is slower. So if 8bit math is needed, it should be used.

Oct 18 '24 14:10 DedeHai

I can add IDF 4 once we move there

I'm really curious to see the numbers for the newer V4 framework 😀 . But yeah, it won't be better than -S3 results.

You could use the esp32_wrover buildenv for measuring - I think it will also work with esp32 that does not have PSRAM.

https://github.com/Aircoookie/WLED/blob/e9d2182390d43d7dd25492f6555d082280e79b3b/platformio.ini#L481

Oct 18 '24 15:10 softhack007

I'm really curious to see the numbers for the newer V4 framework 😀 . But yeah, it won't be better than -S3 results.

I ran the code again on V4 builds, its all the same for S3, C3, S2 but on ESP32 Float Division seems to have improved to the same as S3. Could also be a glitch in my previous test.

Apr 28 '25 17:04 DedeHai

copied from #4798 PSRAM-Speed I ran a simple test on an S3 that compares access to a DRAM buffer to PSRAM. Read access is done like this:

  for (size_t i = 0; i < count; i++) {
    sum += mem[indices[i]];
  }

Write acces like this:

  for (size_t i = 0; i < count; i++) {
    mem[indices[i]] = i;
  }

The first access to PSRAM in the test is always slower, most likely as that is when PSRAM buffer get chached. Result: buffers up to ~32k will be (mostly) in cache and any access is quite fast (may differ on ESP32). How this changes if using multiple buffers in PSRAM I did not test and it all depends on how clever the caching is. Partial random access is done in blocks of ~5% buffer size, randomly distributed.

Memory access speed test (6144 elements, ~24 KB)
=== Partial Random Access ===
DRAM           : Write    353 us, Read    411 us
PSRAM          : Write    506 us, Read    411 us
PSRAM vs DRAM Write: 143.3%   -> note: this is slower due to initial caching of PSRAM buffer (first access in test sequence)
PSRAM vs DRAM Read : 100.0%

=== Random Access ===
DRAM           : Write    344 us, Read    418 us
PSRAM          : Write    337 us, Read    411 us
PSRAM vs DRAM Write: 98.0%
PSRAM vs DRAM Read : 98.3%

=== Sequential Access ===
DRAM           : Write    343 us, Read    420 us
PSRAM          : Write    345 us, Read    411 us
PSRAM vs DRAM Write: 100.6%
PSRAM vs DRAM Read : 97.9%



Memory access speed test (8192 elements, ~32 KB)
=== Partial Random Access ===
DRAM           : Write    453 us, Read    554 us
PSRAM          : Write    836 us, Read    562 us
PSRAM vs DRAM Write: 184.5% -> note: this is slower due to initial caching of PSRAM buffer (first access in test sequence)
PSRAM vs DRAM Read : 101.4%

=== Random Access ===
DRAM           : Write    460 us, Read    554 us
PSRAM          : Write    534 us, Read    572 us
PSRAM vs DRAM Write: 116.1%
PSRAM vs DRAM Read : 103.2%

=== Sequential Access ===
DRAM           : Write    463 us, Read    547 us
PSRAM          : Write    529 us, Read    575 us
PSRAM vs DRAM Write: 114.3%
PSRAM vs DRAM Read : 105.1%



From 32k buffer size upward, PSRAM starts getting slower due to cache size of about 32k, this hits random access hard

Memory access speed test (10240 elements, ~40 KB)
=== Partial Random Access ===
DRAM           : Write    572 us, Read    691 us
PSRAM          : Write   1747 us, Read   2905 us
PSRAM vs DRAM Write: 305.4%
PSRAM vs DRAM Read : 420.4%

=== Random Access ===
DRAM           : Write    573 us, Read    691 us
PSRAM          : Write   4911 us, Read   3888 us
PSRAM vs DRAM Write: 857.1%
PSRAM vs DRAM Read : 562.7%

=== Sequential Access ===
DRAM           : Write    565 us, Read    699 us
PSRAM          : Write   1414 us, Read   2498 us
PSRAM vs DRAM Write: 250.3%
PSRAM vs DRAM Read : 357.4%



Memory access speed test (16384 elements, ~64 KB)
=== Partial Random Access ===
DRAM           : Write    904 us, Read   1101 us
PSRAM          : Write   4049 us, Read   4189 us
PSRAM vs DRAM Write: 447.9%
PSRAM vs DRAM Read : 380.5%

=== Random Access ===
DRAM           : Write    906 us, Read   1102 us
PSRAM          : Write  18906 us, Read  11410 us
PSRAM vs DRAM Write: 2086.8%
PSRAM vs DRAM Read : 1035.4%

=== Sequential Access ===
DRAM           : Write    904 us, Read   1101 us
PSRAM          : Write   3171 us, Read   3521 us
PSRAM vs DRAM Write: 350.8%
PSRAM vs DRAM Read : 319.8%

Aug 14 '25 19:08 DedeHai

[INFO] Code execution speed considerations for developers

Comparison of basic operations on the CPU architectures

Bit shifts vs. division

Fixed point vs. float

Modulo Operator: %

floating point "literals"

use constexpr

examples

avoid 8bit and 16bit integers for local variables

avoid 8bit and 16bit integers for local variables

use `constexpr`