build: use more aggressive LTO on gcc
Please validate locally before landing this pull-request. This is an area that I'm not familiar with.
Using more aggressive LTO, the startup time of Node.js can be reduced from 38ms to 28ms.
Before
➜ node git:(main) ✗ hyperfine 'out/Release/node index.js' --warmup 30
Benchmark 1: out/Release/node index.js
Time (mean ± σ): 38.7 ms ± 0.8 ms [User: 27.1 ms, System: 8.1 ms]
Range (min … max): 37.9 ms … 42.2 ms 70 runs
After
➜ node git:(main) ✗ hyperfine 'out/Release/node index.js' --warmup 30
Benchmark 1: out/Release/node index.js
Time (mean ± σ): 32.0 ms ± 2.7 ms [User: 26.7 ms, System: 5.3 ms]
Range (min … max): 28.5 ms … 42.3 ms 53 runs
Review requested:
- [ ] @nodejs/gyp
cc @nodejs/build @nodejs/performance please take a look
Just as a question, do we distribute binaries built w/ gcc?
Yes, I think so, but @nodejs/build can answer this better than me.
Have you measured it in different environments?
Only macOS for now. I can measure it on Linux tomorrow, if someone doesn't beat me to it. Please run and share outputs from your machine regardless. Let's make sure the numbers are correct.
Only macOS for now. I can measure it on Linux tomorrow, if someone doesn't beat me to it. Please run and share outputs from your machine regardless. Let's make sure the numbers are correct.
I can test linux in a couple hours
Only macOS for now.
You build with gcc on macOS? 🧐
I know nothing about LTO
@joyeecheung The improvements on macOS probably come from 'enable_lto': 'true', not the gcc config changes.
If -ffunctions and -fdata-sections is passed, should be not also --gc-sections passed to the linker?
So I tried building with configure --ninja --enable-lto, but V8's own mksnapshot never finishes linking....trying to debug that stuck process shows that it's hard at work actuallhy doing LTO, but in the CI of this PR, it seems to just finish everything on macOS in 42m, which makes me wonder if this PR actually manages to enable LTO..
Which now reminds me, we need to disable ICF for both node_mksnapshot and v8's mksnapshot, because otherwise the external references may not match after functions are folded and the program may crash. Which means porting this
https://github.com/nodejs/node/blob/5ab3140dfbf0dfa33a66a6af62942b4290f370c1/deps/v8/BUILD.gn#L7351-L7366
To the gyp configs.
(IIRC windows enables LTO + ICF and it seems to have no problem, but maybe I am reading it wrong, but anyway it can be risky not to disable it)
Just as a question, do we distribute binaries built w/ gcc?
Yes :-)
I tried building this PR locally on macOS, and AFAICT, it doesn't enable LTO.
❯ hyperfine "./node-c4aa34aa test/fixtures/empty.js" "./node-lto test/fixtures/empty.js" --warmup 30
Benchmark 1: ./node-c4aa34aa test/fixtures/empty.js
Time (mean ± σ): 32.4 ms ± 2.1 ms [User: 29.4 ms, System: 2.3 ms]
Range (min … max): 31.4 ms … 47.4 ms 87 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: ./node-lto test/fixtures/empty.js
Time (mean ± σ): 33.6 ms ± 2.3 ms [User: 29.5 ms, System: 2.4 ms]
Range (min … max): 32.4 ms … 52.3 ms 84 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
'./node-c4aa34aa test/fixtures/empty.js' ran
1.04 ± 0.10 times faster than './node-lto test/fixtures/empty.js'
And the config.gypi generated by configure is the same as the main branch - enable_lto is still false in there, which overrides common.gypi, and as a result the command used to link the node executable is the same as the main branch.
ccache clang++ -Wl,-force_load,libnode.a -Wl,-force_load,libv8_base_without_compiler.a -Wl,-force_load,libzlib.a -Wl,-force_load,libuv.a -Wl,-force_load,libv8_snapshot.a -Wl,-force_load,libopenssl.a -Wl,-search_paths_first -mmacosx-version-min=11.0 -arch arm64 -L./ -stdlib=libc++ -o node obj/gen/node.node_snapshot.o obj/src/node.node_main.o libhistogram.a libnode.a libv8_snapshot.a libv8_libplatform.a libicui18n.a libzlib.a libllhttp.a libcares.a libuv.a libuvwasi.a libnghttp2.a libada.a libsimdjson.a libsimdutf.a libbrotli.a libsqlite.a libopenssl.a libngtcp2.a libnghttp3.a libnbytes.a libncrypto.a libicuucx.a libicudata.a libv8_base_without_compiler.a libv8_libbase.a libv8_abseil.a libv8_zlib.a libv8_compiler.a libv8_turboshaft.a libv8_initializers.a libv8_initializers_slow.a libzlib_inflate_chunk_simd.a libzlib_adler32_simd.a libzlib_arm_crc32.a -framework CoreFoundation -lm
Whereas when I actually build the main branch with ./configure --ninja --enable-lto, it gets stuck forever when linking V8's mksnapshot executable doing a lot of LTO work. That command looks like this (from ps output, which I attached a debugger to and confirmed that it's actually doing LTO...)
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld -demangle -lto_library /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/libLTO.dylib -dynamic -arch arm64 -platform_version macos 11.0.0 13.0 -syslibroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk -o mksnapshot -L./ -L/usr/local/lib -search_paths_first obj/deps/v8/src/snapshot/embedded/mksnapshot.embedded-empty.o obj/deps/v8/src/snapshot/embedded/mksnapshot.embedded-file-writer.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-aix.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-base.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-generic.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-mac.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-win.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-zos.o obj/deps/v8/src/snapshot/mksnapshot.mksnapshot.o obj/deps/v8/src/snapshot/mksnapshot.snapshot-empty.o obj/deps/v8/src/snapshot/mksnapshot.static-roots-gen.o libv8_base_without_compiler.a libv8_init.a libv8_libbase.a libv8_libplatform.a libv8_turboshaft.a libv8_abseil.a libicui18n.a libicuucx.a libicudata.a libv8_zlib.a libv8_compiler.a libv8_initializers.a libv8_initializers_slow.a -lc++ -lSystem /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin/libclang_rt.osx.a
Just as a question, do we distribute binaries built w/ gcc?
https://github.com/nodejs/node/blob/main/BUILDING.md#official-binary-platforms-and-toolchains https://github.com/nodejs/node/blob/c4aa34aa4dc5accb100be07b4aceded83eeb3956/BUILDING.md?plain=1#L158-L190
Re. LTO, we currently do not have that enabled by default on the official binaries. Anecdotally, I understand that downstream Fedora/RHEL Node.js had to turn off LTO due to the increased memory requirements/build times with it enabled.
There's also https://github.com/nodejs/node/pull/49063, which may help.
So I gave it a bit more patience and managed to build the main branch with ./configure --ninja --enable-lto after waiting for 31 minutes since linking of V8's mksnapshot started. In comparison the same steps configured without lto takes just 5s.
Locally it doesn't look like LTO makes much of a difference on macOS despite waiting for more than half of an hour to build...
❯ hyperfine "./node-c4aa34aa test/fixtures/empty.js" "./node-lto-c4aa34aa test/fixtures/empty.js" --warmup 30
Benchmark 1: ./node-c4aa34aa test/fixtures/empty.js
Time (mean ± σ): 32.8 ms ± 3.2 ms [User: 29.5 ms, System: 2.5 ms]
Range (min … max): 31.6 ms … 52.1 ms 87 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: ./node-lto-c4aa34aa test/fixtures/empty.js
Time (mean ± σ): 31.3 ms ± 2.9 ms [User: 27.1 ms, System: 2.4 ms]
Range (min … max): 30.1 ms … 49.7 ms 91 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
'./node-lto-c4aa34aa test/fixtures/empty.js' ran
1.05 ± 0.14 times faster than './node-c4aa34aa test/fixtures/empty.js'
Also to my surprise, binary size is bigger after LTO?!
❯ ls -lah ./node-lto-c4aa34aa
-rwxr-xr-x 1 joyee staff 122M Dec 5 16:14 ./node-lto-c4aa34aa
❯ ls -lah ./node-c4aa34aa
-rwxr-xr-x 1 joyee staff 118M Dec 5 12:39 ./node-c4aa34aa
I tried enabling PGO as well, and it didn't make a difference at all. Maybe we are missing something @joyeecheung
If -ffunctions and -fdata-sections is passed, should be not also --gc-sections passed to the linker?
@ptr1337 I found some examples of --gc-sections not optimizing correctly. https://stackoverflow.com/questions/31521326/gc-sections-discards-used-data Is this not true in 2024/2025?