mold
mold copied to clipboard
No perf benefit with Bazel
We are not seeing any real perf benefit from Mold over lld when using Bazel. Our builds are done on 16 core boxes and take about 15 minutes to complete with Bazel.
Running outside of Bazel, we were indeed seeing massive perf benefits. Is it that the parallel nature of Bazel itself does away with the parallellism benefits of Mold? Or is there a special way to use Mold in Bazel that could speed up perf?
By 15 minutes, you mean you did a fresh build right? What is the perf when you build the project, edit a single source file and rebuild it?
By 15 minutes, you mean you did a fresh build right?
yes that was for a fresh build.
What is the perf when you build the project, edit a single source file and rebuild it?
tried it. the time was exactly the same with and without mold - 17 seconds. when i make the small change I see only 2 cores running at 100% when during the subsequent bazel build whether i used mold or not.
what should i be expecting to be different in this case?
First I'd verify that I'm really using mold for sure. Please run readelf -p .comment <your-executable-file> and see the output. Is there a string "mold" there?
can verify output is indeed mold. here is the output using mold linker:
pdeva@code-pdeva:~/code/monorepo$ readelf -p .comment bazel-out/k8-fastbuild/bin/services/glutton/src/bin/bin
String dump of section '.comment':
[ 0] mold 1.8.0 (a49a201695edd294ed4d97231c9dc5a994275dd2; compatible with GNU ld)
[ 4f] clang version 14.0.0
Here it is with LLD:
pdeva@code-pdeva:~/code/monorepo$ readelf -p .comment bazel-out/k8-fastbuild/bin/services/glutton/src/bin/bin
String dump of section '.comment':
[ 0] clang version 14.0.0
[ 15] Linker: LLD 14.0.0
tried it. the time was exactly the same with and without mold - 17 seconds. when i make the small change I see only 2 cores running at 100% when during the subsequent
bazel buildwhether i used mold or not.
In my (limited) testing, I also noticed that lld and mold run roughly at the same speed when using 2 threads (see this message for benchmark results). Only when using more threads does mold start to be significantly faster than lld. This is also roughly what one of the lld developers found
Now that you have confirmed that your bazel invocation indeed links using mold, I would try to find out how exactly Bazel invokes the linker. It might somehow limits its thread count to 2. Maybe you could run Bazel in verbose mode (assuming it supports that). Or you could use strace -f (I've successfully done that with ninja).
@pdeva did you ever find out what the issue was?
My company has a very large repository running Bazel. We currently use LLD within LLVM 10.0.0. Here are my benchmarks comparing to Mold 2.31.0. In the best cases, mold was only slightly faster than LLD, but not until I forced it to parallelize with --threads. Since we'd hoped to gain speedup in linking our thousands of unit test binaries, but didn't find that here, we decided against the transition.
Bazel link all C unity tests (965 binaries)
- Collect list of UT:
bazel query 'kind(cc_unity_test, //...) except attr(tags, "manual", //...)' > unity_targets.txt - Build once to populate cache:
bazel build $(cat unity_targets.txt) --no-remote-exec - Modify linkopt to break link cache and rebuild, log this time. Repeat 3x.
Built with --jobs=1
MOLD
- Elapsed: 108.797, Critical Path: 0.297
- Elapsed: 84.788, Critical Path: 0.247 (
-Wl,--threads=8)
LLD
- Elapsed: 101.467, Critical Path: 0.293
Built parallelized
MOLD
- Elapsed: 15.948, Critical Path: 1.230
- Elapsed: 10.715, Critical Path: 0.893 (
-Wl,--threads=8)
LLD
- Elapsed: 10.774, Critical Path: 0.837
Bazel link all C++ google tests (592 binaries)
- Collect list of UT:
bazel query 'kind(cc_google_test, //...) except attr(tags, "manual", //...)' > google_targets.txt - Build once to populate cache:
bazel build $(cat google_targets.txt) --no-remote-exec - Modify linkopt to break link cache and rebuild, log this time. Repeat 3x, average.
Built with --jobs=1
MOLD
- Elapsed: 81.363, Critical Path: 0.337
- Elapsed: 70.159, Critical Path: 0.307 (
-Wl,--threads=8)
LLD
- Elapsed: 104.635, Critical Path: 0.613
Built parallelized
MOLD
- Elapsed: 23.717, Critical Path: 3.047
- Elapsed: 14.082, Critical Path: 1.773 (
-Wl,--threads=8)
LLD
- Elapsed: 10.808, Critical Path: 1.316
Bazel link mixed C/C++ product binaries (68 binaries)
- Collect list of binaries:
bazel query 'kind(cc_configured_binary, //...) except attr(tags, "manual", //...)' | grep x86 > binary_targets.txt - Build once to populate cache:
bazel build $(cat binary_targets.txt) --no-remote-exec - Modify linkopt to break link cache and rebuild, log this time. Repeat 3x, average.
Built with --jobs=1
MOLD
- Elapsed: 23.605, Critical Path: 0.377
- Elapsed: 22.902, Critical Path: 0.340 (
-Wl,--threads=8)
LLD
- Elapsed: 28.161, Critical Path: 0.607
Built parallelized
MOLD
- Elapsed: 6.270, Critical Path: 2.180
- Elapsed: 4.464, Critical Path: 1.187 (
-Wl,--threads=8)
LLD
- Elapsed: 4.533, Critical Path: 1.320
@rdeushane Thank you for sharing the benchmark result! It's unfortunate that mold didn't make a significant difference. There are a few random observations:
- mold by default spawns as many threads as the number of cores, so it is odd that passing
--threads=8makes a difference. Maybe bazel by default pass--no-threadsto the linker? - In general, mold makes a big difference when creating a large binary. If you are creating hundreds of small binaries, other overhead such as process startup, etc. becomes dominant.
@rui314
In general, mold makes a big difference when creating a large binary. If you are creating hundreds of small binaries, other overhead such as process startup, etc. becomes dominant.
That makes sense, the tests being linked are all unit tests, hence pretty small executable size. Our product binaries tend to be on the small side as well, all well below 50-100MB.
mold by default spawns as many threads as the number of cores, so it is odd that passing --threads=8 makes a difference. Maybe bazel by default pass --no-threads to the linker?
I confirmed that Bazel doesn't tamper with any of the threading requests in our links. I think what might be telling here is that in most of the cases where I'm building with --jobs=1, as in no parallel link actions happening simultaneously, mold begins to pull ahead even before I tamper with --threads.
I think in situations where we're linking single massive binaries, i.e. google chrome, there would be significant benefits as you mentioned. But when we're linking thousands of small binaries, and all of those binary link actions are being parallelized by the build system already, the benefits of a linker with more efficient threading abilities are reduced since we're near saturating the system as-is, so linker invocations may benefit from threading less than they would otherwise.
@rdeushane unless you have a custom rule, tests are linked dynamically in Bazel AFAIK
@dieortin We do have all custom test rules, here's a sample link line:
-o
bazel-out/k8-fastbuild/bin/components/command_manager/posix/app/uds/uds_client/test/test
-Wl,-S
bazel-out/k8-fastbuild/bin/components/command_manager/posix/app/uds/uds_client/test/_objs/test/test_uds_client.pic.o
(about 200 "*.a" files here)
-lc++
-lc++abi
-lm
-Wl,--build-id=md5
-Wl,--hash-style=gnu
-Wl,-z,relro
-Wl,-z,now
-Wl,--enable-new-dtags
-pthread
-lpthread
-Wl,--gc-sections