node icon indicating copy to clipboard operation
node copied to clipboard

build: use more aggressive LTO on gcc

Open anonrig opened this issue 1 year ago • 17 comments

Please validate locally before landing this pull-request. This is an area that I'm not familiar with.

Using more aggressive LTO, the startup time of Node.js can be reduced from 38ms to 28ms.

Before

➜  node git:(main) ✗ hyperfine 'out/Release/node index.js' --warmup 30
Benchmark 1: out/Release/node index.js
  Time (mean ± σ):      38.7 ms ±   0.8 ms    [User: 27.1 ms, System: 8.1 ms]
  Range (min … max):    37.9 ms …  42.2 ms    70 runs

After

➜  node git:(main) ✗ hyperfine 'out/Release/node index.js' --warmup 30
Benchmark 1: out/Release/node index.js
  Time (mean ± σ):      32.0 ms ±   2.7 ms    [User: 26.7 ms, System: 5.3 ms]
  Range (min … max):    28.5 ms …  42.3 ms    53 runs

anonrig avatar Dec 05 '24 00:12 anonrig

Review requested:

  • [ ] @nodejs/gyp

nodejs-github-bot avatar Dec 05 '24 00:12 nodejs-github-bot

cc @nodejs/build @nodejs/performance please take a look

anonrig avatar Dec 05 '24 00:12 anonrig

Just as a question, do we distribute binaries built w/ gcc?

Yes, I think so, but @nodejs/build can answer this better than me.

anonrig avatar Dec 05 '24 02:12 anonrig

Have you measured it in different environments?

Only macOS for now. I can measure it on Linux tomorrow, if someone doesn't beat me to it. Please run and share outputs from your machine regardless. Let's make sure the numbers are correct.

anonrig avatar Dec 05 '24 02:12 anonrig

Only macOS for now. I can measure it on Linux tomorrow, if someone doesn't beat me to it. Please run and share outputs from your machine regardless. Let's make sure the numbers are correct.

I can test linux in a couple hours

juanarbol avatar Dec 05 '24 02:12 juanarbol

Only macOS for now.

You build with gcc on macOS? 🧐

joyeecheung avatar Dec 05 '24 07:12 joyeecheung

I know nothing about LTO

targos avatar Dec 05 '24 08:12 targos

@joyeecheung The improvements on macOS probably come from 'enable_lto': 'true', not the gcc config changes.

targos avatar Dec 05 '24 08:12 targos

If -ffunctions and -fdata-sections is passed, should be not also --gc-sections passed to the linker?

ptr1337 avatar Dec 05 '24 09:12 ptr1337

So I tried building with configure --ninja --enable-lto, but V8's own mksnapshot never finishes linking....trying to debug that stuck process shows that it's hard at work actuallhy doing LTO, but in the CI of this PR, it seems to just finish everything on macOS in 42m, which makes me wonder if this PR actually manages to enable LTO..

Which now reminds me, we need to disable ICF for both node_mksnapshot and v8's mksnapshot, because otherwise the external references may not match after functions are folded and the program may crash. Which means porting this

https://github.com/nodejs/node/blob/5ab3140dfbf0dfa33a66a6af62942b4290f370c1/deps/v8/BUILD.gn#L7351-L7366

To the gyp configs.

(IIRC windows enables LTO + ICF and it seems to have no problem, but maybe I am reading it wrong, but anyway it can be risky not to disable it)

joyeecheung avatar Dec 05 '24 10:12 joyeecheung

Just as a question, do we distribute binaries built w/ gcc?

Yes :-)

sxa avatar Dec 05 '24 11:12 sxa

I tried building this PR locally on macOS, and AFAICT, it doesn't enable LTO.

❯ hyperfine "./node-c4aa34aa test/fixtures/empty.js" "./node-lto  test/fixtures/empty.js" --warmup 30
Benchmark 1: ./node-c4aa34aa test/fixtures/empty.js
  Time (mean ± σ):      32.4 ms ±   2.1 ms    [User: 29.4 ms, System: 2.3 ms]
  Range (min … max):    31.4 ms …  47.4 ms    87 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./node-lto  test/fixtures/empty.js
  Time (mean ± σ):      33.6 ms ±   2.3 ms    [User: 29.5 ms, System: 2.4 ms]
  Range (min … max):    32.4 ms …  52.3 ms    84 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  './node-c4aa34aa test/fixtures/empty.js' ran
    1.04 ± 0.10 times faster than './node-lto  test/fixtures/empty.js'

And the config.gypi generated by configure is the same as the main branch - enable_lto is still false in there, which overrides common.gypi, and as a result the command used to link the node executable is the same as the main branch.

ccache clang++ -Wl,-force_load,libnode.a -Wl,-force_load,libv8_base_without_compiler.a -Wl,-force_load,libzlib.a -Wl,-force_load,libuv.a -Wl,-force_load,libv8_snapshot.a -Wl,-force_load,libopenssl.a -Wl,-search_paths_first -mmacosx-version-min=11.0 -arch arm64 -L./ -stdlib=libc++ -o node obj/gen/node.node_snapshot.o obj/src/node.node_main.o libhistogram.a libnode.a libv8_snapshot.a libv8_libplatform.a libicui18n.a libzlib.a libllhttp.a libcares.a libuv.a libuvwasi.a libnghttp2.a libada.a libsimdjson.a libsimdutf.a libbrotli.a libsqlite.a libopenssl.a libngtcp2.a libnghttp3.a libnbytes.a libncrypto.a libicuucx.a libicudata.a libv8_base_without_compiler.a libv8_libbase.a libv8_abseil.a libv8_zlib.a libv8_compiler.a libv8_turboshaft.a libv8_initializers.a libv8_initializers_slow.a libzlib_inflate_chunk_simd.a libzlib_adler32_simd.a libzlib_arm_crc32.a  -framework CoreFoundation -lm

Whereas when I actually build the main branch with ./configure --ninja --enable-lto, it gets stuck forever when linking V8's mksnapshot executable doing a lot of LTO work. That command looks like this (from ps output, which I attached a debugger to and confirmed that it's actually doing LTO...)

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld -demangle -lto_library /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/libLTO.dylib -dynamic -arch arm64 -platform_version macos 11.0.0 13.0 -syslibroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk -o mksnapshot -L./ -L/usr/local/lib -search_paths_first obj/deps/v8/src/snapshot/embedded/mksnapshot.embedded-empty.o obj/deps/v8/src/snapshot/embedded/mksnapshot.embedded-file-writer.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-aix.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-base.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-generic.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-mac.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-win.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-zos.o obj/deps/v8/src/snapshot/mksnapshot.mksnapshot.o obj/deps/v8/src/snapshot/mksnapshot.snapshot-empty.o obj/deps/v8/src/snapshot/mksnapshot.static-roots-gen.o libv8_base_without_compiler.a libv8_init.a libv8_libbase.a libv8_libplatform.a libv8_turboshaft.a libv8_abseil.a libicui18n.a libicuucx.a libicudata.a libv8_zlib.a libv8_compiler.a libv8_initializers.a libv8_initializers_slow.a -lc++ -lSystem /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin/libclang_rt.osx.a

joyeecheung avatar Dec 05 '24 11:12 joyeecheung

Just as a question, do we distribute binaries built w/ gcc?

https://github.com/nodejs/node/blob/main/BUILDING.md#official-binary-platforms-and-toolchains https://github.com/nodejs/node/blob/c4aa34aa4dc5accb100be07b4aceded83eeb3956/BUILDING.md?plain=1#L158-L190

richardlau avatar Dec 05 '24 13:12 richardlau

Re. LTO, we currently do not have that enabled by default on the official binaries. Anecdotally, I understand that downstream Fedora/RHEL Node.js had to turn off LTO due to the increased memory requirements/build times with it enabled.

There's also https://github.com/nodejs/node/pull/49063, which may help.

richardlau avatar Dec 05 '24 13:12 richardlau

So I gave it a bit more patience and managed to build the main branch with ./configure --ninja --enable-lto after waiting for 31 minutes since linking of V8's mksnapshot started. In comparison the same steps configured without lto takes just 5s.

Locally it doesn't look like LTO makes much of a difference on macOS despite waiting for more than half of an hour to build...

❯ hyperfine "./node-c4aa34aa test/fixtures/empty.js" "./node-lto-c4aa34aa test/fixtures/empty.js" --warmup 30
Benchmark 1: ./node-c4aa34aa test/fixtures/empty.js
  Time (mean ± σ):      32.8 ms ±   3.2 ms    [User: 29.5 ms, System: 2.5 ms]
  Range (min … max):    31.6 ms …  52.1 ms    87 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./node-lto-c4aa34aa test/fixtures/empty.js
  Time (mean ± σ):      31.3 ms ±   2.9 ms    [User: 27.1 ms, System: 2.4 ms]
  Range (min … max):    30.1 ms …  49.7 ms    91 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  './node-lto-c4aa34aa test/fixtures/empty.js' ran
    1.05 ± 0.14 times faster than './node-c4aa34aa test/fixtures/empty.js'

Also to my surprise, binary size is bigger after LTO?!

❯ ls -lah ./node-lto-c4aa34aa
-rwxr-xr-x  1 joyee  staff   122M Dec  5 16:14 ./node-lto-c4aa34aa
❯ ls -lah ./node-c4aa34aa
-rwxr-xr-x  1 joyee  staff   118M Dec  5 12:39 ./node-c4aa34aa

joyeecheung avatar Dec 05 '24 15:12 joyeecheung

I tried enabling PGO as well, and it didn't make a difference at all. Maybe we are missing something @joyeecheung

anonrig avatar Dec 05 '24 15:12 anonrig

If -ffunctions and -fdata-sections is passed, should be not also --gc-sections passed to the linker?

@ptr1337 I found some examples of --gc-sections not optimizing correctly. https://stackoverflow.com/questions/31521326/gc-sections-discards-used-data Is this not true in 2024/2025?

anonrig avatar Dec 05 '24 15:12 anonrig