crosstool-NG xtensa: Size optimization regression between GCC 8.4.0 and 13.2.0

trafficstars

We are migrating from IDF 4.4.4 to 5.2.1 and among the many changes is the toolchgain update from GCC 8.4.0 (xtensa-esp32-elf-gcc8_4_0-esp-2021r2-patch5) to 13.2.0 (xtensa-esp-elf-13.2.0_20230928).

Unfortunately, it seems that it comes with an across the board regression in the output binary size - many functions gain size, resulting in overall binary size increase, most critically we are bumping into IRAM size limits on some of our apps.

By comparing code generated with the different toolchains, i identified the following problems:

movi / mov.n : generates movi for small numbers in some cases, where mov.n could be used. 8.4 is doing it too, but 13 does it more often, it seems - at least in this particular function.
the new toolchain replaces certain literals with bit operations, for example 0xffffff is generated as: movi.n a9, -1 | srli a9, a9, 8. it saves on memory access and literal but uses 5 instruction bytes instead of 3.
sometimes this code is duplicated where previous result could be reused.

On the other hand, there are positives too:

inlined loop-based memset: size-neutral and eliminates a call. neat!
better register allocation: using a10 avoids an extra move when it's used again as an argument to a function call.

This was after just a quick look at one particular function: spi_flash_chip_winbond_page_program, it gained 10 bytes. 4 bytes of those are accounted for by initialization of .flags, but even without that the function comes out 2 bytes bigger. Here are the notes from my analysis.

Apr 13 '24 01:04 rojer

ok, some of the movi /movi.n difference can be attributed to jump address alignment and the denisty can be improved with -mno-target-align. this still leaves one movi where movi.ncould be used - in the struct initialzation block. however, size reduction is significant: 8K in our case.

Apr 13 '24 19:04 rojer

@rojer , sorry for the long delay.

Recently binary size was fixed with linker script changes:

https://github.com/espressif/esp-idf/commit/9375348740b8a8e0ed2fa532ff3184da1f47478e
https://github.com/espressif/esp-idf/commit/2b36636f6fb11e37671d384055db47ca2e0446ed
https://github.com/espressif/esp-idf/commit/afffbd9dd078f767e6d7e03fe82cc197a7bcf709

Could you please check your project with these changes?

Jun 10 '24 18:06 Lapshin

well, those seem unrelated to the issues i reported here that are about assembly code generation. that said - thank you for letting me know, i will see if i can test with these changes soon.

Jun 10 '24 19:06 rojer

@rojer , sure, these changes are unrelated but that also reduces binary size significantly after the toolchain upgrade. I didn't have time to take a look at your points yet, will do this a bit later.

@jcmvbkbc , maybe you can answer them quickly?

Jun 11 '24 02:06 Lapshin

movi / mov.n : generates movi for small numbers in some cases, where mov.n could be used. 8.4 is doing it too, but 13 does it more often, it seems - at least in this particular function.

This is not decided by the compiler alone, the assembler and linker will transform instructions between narrow and standard form to maintain alignment/fill the gaps created by relaxations. Meaningful comparison has to take a lot of factors into an account

the new toolchain replaces certain literals with bit operations, for example 0xffffff is generated as: movi.n a9, -1 | srli a9, a9, 8. it saves on memory access and literal but uses 5 instruction bytes instead of 3.

I believe that it was meant to be that way with the patch https://gcc.gnu.org/git/gcc.git?h=cd02f15f1aecc45b2c2feae16840503549508619 I don't see a way to affect this compiler decision by any command line option. @jjsuwa-sys3175 Suwa-san, what do you think?

Jun 11 '24 08:06 jcmvbkbc

movi / mov.n : generates movi for small numbers in some cases, where mov.n could be used. 8.4 is doing it too, but 13 does it more often, it seems - at least in this particular function.

This is not decided by the compiler alone, the assembler and linker will transform instructions between narrow and standard form to maintain alignment/fill the gaps created by relaxations. Meaningful comparison has to take a lot of factors into an account

I agree so.

the new toolchain replaces certain literals with bit operations, for example 0xffffff is generated as: movi.n a9, -1 | srli a9, a9, 8. it saves on memory access and literal but uses 5 instruction bytes instead of 3.

and 4 bytes of litpool entry :)

I believe that it was meant to be that way with the patch https://gcc.gnu.org/git/gcc.git?h=cd02f15f1aecc45b2c2feae16840503549508619 I don't see a way to affect this compiler decision by any command line option. @jjsuwa-sys3175 Suwa-san, what do you think?

Maybe it would be better to have an option to control whether or not constant synthesis is performed individually.

Jun 11 '24 08:06 jjsuwa-sys3175

right, most of the movi/movi.n stuff was rectified with -mno-target-align - but not all.

and 4 bytes of litpool entry :)

sure, but 0xffffffff aka -1 is widely used and is shared and deduplicated. now, the really clever thing to do would be to use literals for constants that are used multiple times and use synthesis for those that aren't, in which case this would be a net reduction in size. however, the current situation is that this resulted in net gain in size, i maybe best to not apply this optimization in -Os mode?

Jun 11 '24 08:06 rojer

This may not be an excuse, but for the ESP8266 Arduino core, these patches including the constant synthesis work well in terms of reducing size and clock cycles.

right, most of the movi/movi.n stuff was rectified with -mno-target-align - but not all.

and 4 bytes of litpool entry :)

sure, but 0xffffffff aka -1 is widely used and is shared and deduplicated. now, the really clever thing to do would be to use literals for constants that are used multiple times and use synthesis for those that aren't, in which case this would be a net reduction in size. however, the current situation is that this resulted in net gain in size, i maybe best to not apply this optimization in -Os mode?

As it happens, I have a WIP patch that does exactly what you describe :)

The reason why that patch has been left in WIP for so long is that gcc's built-in litpool mechanism provides duplicate elimination but does not provide recording the number of duplicates and referencing them later. So, I have to create them myself, which is so ugly :-(

Jun 11 '24 08:06 jjsuwa-sys3175

@Lapshin

Could you please check your project with these changes?

we are using 5.2.1, and these were kind of difficult to apply after the big refactoring in https://github.com/espressif/esp-idf/commit/40be44f8274f4d160cd4bcd25e37a6d72141a929 but i tried to follow the spirit if not the letter

espressif/esp-idf@9375348

discarding eh_frame if exceptions are not enabled - nice, significant reduction of ~10K

espressif/esp-idf@2b36636

couldn't apply this to app, no visible effect on bootloader

espressif/esp-idf@afffbd9

~500 byte saved

Jun 11 '24 21:06 rojer

we are using 5.2.1

@rojer , you may consider switching to bugfix version 5.2.2 which has these two commits https://github.com/espressif/esp-idf/commit/1f3f65b40ea01b93aa4b48c72671c7d84b504766 , https://github.com/espressif/esp-idf/commit/dcf6b54e94021ad2901ad5fdc29263154d32b3d5

Regarding the third change - it's already backported to release/v5.2 branch (but changes are not synced with github yet). They seem to appear soon

no visible effect on bootloader

This would be a breaking change, will be added in v6.0

Jun 12 '24 08:06 Lapshin

@rojer , you could check https://github.com/espressif/esp-idf/commit/5320ec20f9302f260265b27d55ade1bc92b92ea9 in release/v5.2 branch

Jun 13 '24 04:06 Lapshin

@Lapshin I did a quick test with current release/v5.2 (https://github.com/espressif/esp-idf/commit/e5ffb3c57d59d8c94a4c662a8d0d76afa096bc13), and results are interesting:

Xtensa (ESP32) target binary size went down by almost 50K: 1632032 -> 1581616
Similar RISC-V (ESP32-C3) target went up by almost 12K: 1623536 -> 1635792

needless to say, i'm quite happy with (1) and not so much with (2) :) do you have an explanation for why this may be the case?

Jun 14 '24 15:06 rojer

@rojer , I also did not expect this T_T

You can analyze why it happens for your particular project using command idf_size.py --files --diff OLD_BUILD_MAP_FILE NEW_BUILD_MAP_FILE or with mapuche that has gui

Jun 15 '24 07:06 Lapshin

right, most of the movi/movi.n stuff was rectified with -mno-target-align - but not all.

and 4 bytes of litpool entry :)

sure, but 0xffffffff aka -1 is widely used and is shared and deduplicated. now, the really clever thing to do would be to use literals for constants that are used multiple times and use synthesis for those that aren't, in which case this would be a net reduction in size. however, the current situation is that this resulted in net gain in size, i maybe best to not apply this optimization in -Os mode?

Please see https://github.com/gcc-mirror/gcc/commit/23141088e8fb50bf916ac0b2e364b1eef9f3569d, I hope this is what you wanted :)

Jun 19 '24 09:06 jjsuwa-sys3175

yep, that looks like a nice optimization. the only question is: when will we see it in IDF?

Jun 19 '24 22:06 rojer

yep, that looks like a nice optimization. the only question is: when will we see it in IDF?

All I can say is: I am not a party to that :)

Jun 20 '24 04:06 jjsuwa-sys3175

yep, that looks like a nice optimization. the only question is: when will we see it in IDF?

@rojer , from the IDF side we are waiting for GCC 14.2 release with bugfixes to make the new release. It looks like this will be available from v5.4

@jjsuwa-sys3175 , thank you for giving the solution quickly!

Jun 20 '24 08:06 Lapshin

Sorry for this being completely off-topic, but I would like to mention the use of "-mextra-l32r-costs=" in situations where speed is more important than size.

Reading data from the instruction memory, such as L32R instruction, may require implementation-dependent additional clocks in addition to 1 clock execution for the instruction itself and 1 clock delay due to the pipeline until the read result is available.

For example, on the ESP8266, a delay of 4 or 5 clocks is observed for each L32R instructions that cannot be hidden by overlapping with other instructions.

Therefore, by compiling with "-O2 -mextra-l32r-costs=5", constant synthesis (avoiding the use of time-consuming L32Rs) can be performed more aggressively (the default value of "-mextra-l32r-costs=" is 0, so not specifying it means that there are no additional clocks for the execution of L32R).

Jun 20 '24 11:06 jjsuwa-sys3175

@jjsuwa-sys3175 thanks for the additional context, it's good to know. however, at least in our environment, we almost exclusively operate under the memory pressure, both RAM and flash, while cycles are essentially free or near free.

Jun 20 '24 11:06 rojer

@Lapshin

You can analyze why it happens for your particular project using command idf_size.py --files --diff OLD_BUILD_MAP_FILE NEW_BUILD_MAP_FILE

see attached diffs for ESP32 and ESP32-C3. what do you see? on C3 i see 10K+ of wireless bloat... sorry, innovation :) at the same time, ESP32 lost a bunch of fat.

diff_esp32.txt diff_esp32c3.txt

Jun 25 '24 21:06 rojer

also, looks like idf_size.py has a bug computing .rodata size - no way pthread.c uses 150K+ of rodata. it seems that the first entry gets bogus size.

Jun 25 '24 21:06 rojer

@rojer , sorry for the long delay. Latest news:

Toolchain with GCC 14.2.0 was released
- It has a change https://github.com/gcc-mirror/gcc/commit/23141088e8fb50bf916ac0b2e364b1eef9f3569d

Regarding .rodata size you can learn here (link):

The size of the .rodata section in the Flash Data memory type may appear very large for a single archive. This occurs due to linker relaxations. The linker may attempt to combine object file sections with MERGE and STRINGS flags from all archives into one to perform tail string optimization. Consequently, one archive may end up with a very large .rodata section, containing string literals from other archives. This is evident in the .rodata section of the libesp_app_format.a archive. The specific compiler behavior here can be turned off by enabling CONFIG_COMPILER_NO_MERGE_CONSTANTS option (only for GCC toolchain), please read help for more details.

I will take a look at your diff files soon

Oct 03 '24 13:10 Lapshin

@rojer , as I can see you already created issue https://github.com/espressif/esp-idf/issues/14076 for the part that has grown the most.

Sorry, but I can't help you here because I'm not a member of bluetooth/wifi team. But from the IDF perspective you still have some options to free some space. For example:

Disable CONFIG_ESP_CONSOLE_SECONDARY_USB_SERIAL_JTAG if you don't need this (save about 3200 bytes).
Disable CONFIG_VFS_SUPPORT_TERMIOS (save ~2000 bytes)
The full list you can learn on this page.
this page in case there is lack of free IRAM

Oct 07 '24 12:10 Lapshin

crosstool-NG crosstool-NG copied to clipboard

xtensa: Size optimization regression between GCC 8.4.0 and 13.2.0

crosstool-NG
crosstool-NG copied to clipboard