emscripten icon indicating copy to clipboard operation
emscripten copied to clipboard

WASM binary size with -sMAIN_MODULE 7 to 9 times heavier

Open mho22 opened this issue 2 months ago • 7 comments

Based on issue #23683.

Version of emscripten/emsdk:

emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 4.0.5 (53b38d0c6f9fce1b62c55a8012bc6477f7a42711)
clang version 21.0.0git (https:/github.com/llvm/llvm-project 553da9634dc4bae215e6c850d2de3186d09f9da5)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /root/emsdk/upstream/bin

I got a similar problem while including a heavy 46Mb file instead of using a char array. But I will use it to compare the two results

Former sample program from previous issue:

#include <stdio.h>

char test[1024*1024*50];

int main(void)
{
    puts(test);
    return 0;
}

Compiler command line and results:

emcc -o test.wasm -O3 -sMAIN_MODULE=0 -sTOTAL_MEMORY=200MB test.c
-rw-r--r-- 1 root root  171 Oct  3 06:18 test.c
-rwxr-xr-x 1 root root 2.0K Oct  3 06:18 test.wasm
emcc -o test.wasm -O3 -sMAIN_MODULE=1 -sTOTAL_MEMORY=200MB test.c
-rw-r--r-- 1 root root  171 Oct  3 06:18 test.c
-rwxr-xr-x 1 root root 1.6M Oct  3 06:19 test.wasm
emcc -o test.wasm -O3 -sMAIN_MODULE=2 -sTOTAL_MEMORY=200MB test.c
-rw-r--r-- 1 root root  171 Oct  3 06:18 test.c
-rwxr-xr-x 1 root root 8.3K Oct  3 06:20 test.wasm

The -sMAIN_MODULE=2 here will reduce the size significantly even if it is 4 times heavier than in the static linking. This is expected.

However, I tried something different for my issue :

New sample program:

#include <stdio.h>
#include "data_file.c"

int main(void)
{
    puts((const char *)php_magic_database);
    return 0;
}

The data_file.c is the PHP fileinfo extension libmagic database php_magic_database : https://github.com/php/php-src/blob/PHP-8.3.25/ext/fileinfo/data_file.c. It is 46Mb.

Compiler command line and results:

emcc -o test.wasm -O3 -sMAIN_MODULE=0 -sTOTAL_MEMORY=200MB test.c
-rw-r--r-- 1 root root  46M Oct  3 06:35 data_file.c
-rw-r--r-- 1 root root  168 Oct  3 06:37 test.c
-rwxr-xr-x 1 root root 1.2M Oct  3 06:37 test.wasm
emcc -o test.wasm -O3 -sMAIN_MODULE=1 -sTOTAL_MEMORY=200MB test.c
-rw-r--r-- 1 root root  46M Oct  3 06:35 data_file.c
-rw-r--r-- 1 root root  168 Oct  3 06:37 test.c
-rwxr-xr-x 1 root root 9.2M Oct  3 06:37 test.wasm
emcc -o test.wasm -O3 -sMAIN_MODULE=2 -sTOTAL_MEMORY=200MB test.c
-rw-r--r-- 1 root root  46M Oct  3 06:35 data_file.c
-rw-r--r-- 1 root root  168 Oct  3 06:37 test.c
-rwxr-xr-x 1 root root 7.6M Oct  3 06:41 test.wasm

While for the first experiment we had : 2Kb, then 1.6Mb, then 8kb, now we have from 1.2Mb, 9.2Mb and 7.6Mb. Is there a way to approach the 1.2Mb ? If not, is it possible to statically link that specific file because I know this file won't be used by other modules ?

I need to build my file with MAIN_MODULE since it will have other SIDE_MODULE during runtime load. But I also would like some heavy files to be significantly reduced like the data_file.c since it is only used in my MAIN_MODULE.

Do you know a way to achieve this ?

mho22 avatar Oct 03 '25 06:10 mho22

@sbc100 @kripken Is this a normal behavior and there is no way to shrink data the static way when setting MAIN_MODULE or should I investigate further ?

mho22 avatar Oct 09 '25 08:10 mho22

I'm currently working on a change that will make the main module no longer relocatable: https://github.com/emscripten-core/emscripten/pull/25522.

If the code size difference you are seeing is coming from the relocation functions then this will hopefully remove a lot of this overhead.

Can you see where the main differences between the 3 build above are coming from? Is it a data section or code section? If its the code section which function is it? In particular is the coming from the linker-generated relocation code?

sbc100 avatar Oct 09 '25 17:10 sbc100

@sbc100 The difference comes from the data section. In the MAIN_MODULE=0 version the data section is a list of shrinked sections :

 (data $1 (i32.const 238056) "\fe\03\00 ") ( at most 250 KB long )
 (data $2 (i32.const 238073) "@\00 ")
 (data $3 (i32.const 238088) "\0c\00\00 ")

While in the MAIN_MODULE=2 version there is only one unshrinked section :

 (data $0 (global.get $gimport$0) (26.3MB long )

So nothing related to functions. If I empty the php_magic_database variable from the data_file.c file, it will drastically decrease the size of the wasm file.

mho22 avatar Oct 10 '25 10:10 mho22

@sbc100 @kripken Is this the expected behavior? That data can’t be statically shrunk when MAIN_MODULE is set or should I look into it further?

mho22 avatar Oct 24 '25 10:10 mho22

I'd have to look into the specifics which I have not done yet, but in general I would hope that MAIN_MODULE=2 could do as good a job as MAIN_MODULE=0 and eliminating unused data. I'm guessing that the symbol pointing to the data is been kept alive for some reason, maybe its being exported?

You could try adding -Wl,--trace-symbol=php_magic_database to your link command which will tell you why / when that particular symbol is included by the linker.

sbc100 avatar Oct 24 '25 16:10 sbc100

@sbc100 I tried your pull request and it worked like a charm! The wasm test file is now expected to be 1.5Mb instead of 9.9Mb. So MAIN_MODULE=0 and MAIN_MODULE=2 eliminates perfectly unused data as it should be.

mho22 avatar Oct 25 '25 18:10 mho22

Wow, thats great news. That shows that https://github.com/emscripten-core/emscripten/pull/25522 is working as intended!

sbc100 avatar Oct 25 '25 21:10 sbc100