mold No perf benefit with Bazel

We are not seeing any real perf benefit from Mold over lld when using Bazel. Our builds are done on 16 core boxes and take about 15 minutes to complete with Bazel.

Running outside of Bazel, we were indeed seeing massive perf benefits. Is it that the parallel nature of Bazel itself does away with the parallellism benefits of Mold? Or is there a special way to use Mold in Bazel that could speed up perf?

Dec 27 '22 03:12 pdeva

By 15 minutes, you mean you did a fresh build right? What is the perf when you build the project, edit a single source file and rebuild it?

Dec 27 '22 03:12 rui314

By 15 minutes, you mean you did a fresh build right?

yes that was for a fresh build.

What is the perf when you build the project, edit a single source file and rebuild it?

tried it. the time was exactly the same with and without mold - 17 seconds. when i make the small change I see only 2 cores running at 100% when during the subsequent bazel build whether i used mold or not.

what should i be expecting to be different in this case?

Dec 27 '22 06:12 pdeva

First I'd verify that I'm really using mold for sure. Please run readelf -p .comment <your-executable-file> and see the output. Is there a string "mold" there?

Dec 27 '22 06:12 rui314

can verify output is indeed mold. here is the output using mold linker:

pdeva@code-pdeva:~/code/monorepo$ readelf -p .comment bazel-out/k8-fastbuild/bin/services/glutton/src/bin/bin

String dump of section '.comment':
  [     0]  mold 1.8.0 (a49a201695edd294ed4d97231c9dc5a994275dd2; compatible with GNU ld)
  [    4f]  clang version 14.0.0

Here it is with LLD:

pdeva@code-pdeva:~/code/monorepo$ readelf -p .comment bazel-out/k8-fastbuild/bin/services/glutton/src/bin/bin

String dump of section '.comment':
  [     0]  clang version 14.0.0
  [    15]  Linker: LLD 14.0.0

Dec 31 '22 19:12 pdeva

tried it. the time was exactly the same with and without mold - 17 seconds. when i make the small change I see only 2 cores running at 100% when during the subsequent bazel build whether i used mold or not.

In my (limited) testing, I also noticed that lld and mold run roughly at the same speed when using 2 threads (see this message for benchmark results). Only when using more threads does mold start to be significantly faster than lld. This is also roughly what one of the lld developers found

Now that you have confirmed that your bazel invocation indeed links using mold, I would try to find out how exactly Bazel invokes the linker. It might somehow limits its thread count to 2. Maybe you could run Bazel in verbose mode (assuming it supports that). Or you could use strace -f (I've successfully done that with ninja).

Jan 13 '23 18:01 moncefmechri

@pdeva did you ever find out what the issue was?

Jul 11 '23 23:07 dieortin

My company has a very large repository running Bazel. We currently use LLD within LLVM 10.0.0. Here are my benchmarks comparing to Mold 2.31.0. In the best cases, mold was only slightly faster than LLD, but not until I forced it to parallelize with --threads. Since we'd hoped to gain speedup in linking our thousands of unit test binaries, but didn't find that here, we decided against the transition.

Bazel link all C unity tests (965 binaries)

Collect list of UT: bazel query 'kind(cc_unity_test, //...) except attr(tags, "manual", //...)' > unity_targets.txt
Build once to populate cache: bazel build $(cat unity_targets.txt) --no-remote-exec
Modify linkopt to break link cache and rebuild, log this time. Repeat 3x.

Built with --jobs=1

MOLD

Elapsed: 108.797, Critical Path: 0.297
Elapsed: 84.788, Critical Path: 0.247 (-Wl,--threads=8)

LLD

Elapsed: 101.467, Critical Path: 0.293

Built parallelized

MOLD

Elapsed: 15.948, Critical Path: 1.230
Elapsed: 10.715, Critical Path: 0.893 (-Wl,--threads=8)

LLD

Elapsed: 10.774, Critical Path: 0.837

Bazel link all C++ google tests (592 binaries)

Collect list of UT: bazel query 'kind(cc_google_test, //...) except attr(tags, "manual", //...)' > google_targets.txt
Build once to populate cache: bazel build $(cat google_targets.txt) --no-remote-exec
Modify linkopt to break link cache and rebuild, log this time. Repeat 3x, average.

Built with --jobs=1

MOLD

Elapsed: 81.363, Critical Path: 0.337
Elapsed: 70.159, Critical Path: 0.307 (-Wl,--threads=8)

LLD

Elapsed: 104.635, Critical Path: 0.613

Built parallelized

MOLD

Elapsed: 23.717, Critical Path: 3.047
Elapsed: 14.082, Critical Path: 1.773 (-Wl,--threads=8)

LLD

Elapsed: 10.808, Critical Path: 1.316

Bazel link mixed C/C++ product binaries (68 binaries)

Collect list of binaries: bazel query 'kind(cc_configured_binary, //...) except attr(tags, "manual", //...)' | grep x86 > binary_targets.txt
Build once to populate cache: bazel build $(cat binary_targets.txt) --no-remote-exec
Modify linkopt to break link cache and rebuild, log this time. Repeat 3x, average.

Built with --jobs=1

MOLD

Elapsed: 23.605, Critical Path: 0.377
Elapsed: 22.902, Critical Path: 0.340 (-Wl,--threads=8)

LLD

Elapsed: 28.161, Critical Path: 0.607

Built parallelized

MOLD

Elapsed: 6.270, Critical Path: 2.180
Elapsed: 4.464, Critical Path: 1.187 (-Wl,--threads=8)

LLD

Elapsed: 4.533, Critical Path: 1.320

May 21 '24 01:05 rdeushane

@rdeushane Thank you for sharing the benchmark result! It's unfortunate that mold didn't make a significant difference. There are a few random observations:

mold by default spawns as many threads as the number of cores, so it is odd that passing --threads=8 makes a difference. Maybe bazel by default pass --no-threads to the linker?
In general, mold makes a big difference when creating a large binary. If you are creating hundreds of small binaries, other overhead such as process startup, etc. becomes dominant.

May 21 '24 02:05 rui314

@rui314

In general, mold makes a big difference when creating a large binary. If you are creating hundreds of small binaries, other overhead such as process startup, etc. becomes dominant.

That makes sense, the tests being linked are all unit tests, hence pretty small executable size. Our product binaries tend to be on the small side as well, all well below 50-100MB.

mold by default spawns as many threads as the number of cores, so it is odd that passing --threads=8 makes a difference. Maybe bazel by default pass --no-threads to the linker?

I confirmed that Bazel doesn't tamper with any of the threading requests in our links. I think what might be telling here is that in most of the cases where I'm building with --jobs=1, as in no parallel link actions happening simultaneously, mold begins to pull ahead even before I tamper with --threads.

I think in situations where we're linking single massive binaries, i.e. google chrome, there would be significant benefits as you mentioned. But when we're linking thousands of small binaries, and all of those binary link actions are being parallelized by the build system already, the benefits of a linker with more efficient threading abilities are reduced since we're near saturating the system as-is, so linker invocations may benefit from threading less than they would otherwise.

May 21 '24 17:05 rdeushane

@rdeushane unless you have a custom rule, tests are linked dynamically in Bazel AFAIK

May 21 '24 18:05 dieortin

@dieortin We do have all custom test rules, here's a sample link line:

-o
bazel-out/k8-fastbuild/bin/components/command_manager/posix/app/uds/uds_client/test/test
-Wl,-S
bazel-out/k8-fastbuild/bin/components/command_manager/posix/app/uds/uds_client/test/_objs/test/test_uds_client.pic.o
   (about 200 "*.a" files here)
-lc++
-lc++abi
-lm
-Wl,--build-id=md5
-Wl,--hash-style=gnu
-Wl,-z,relro
-Wl,-z,now
-Wl,--enable-new-dtags
-pthread
-lpthread
-Wl,--gc-sections

May 21 '24 18:05 rdeushane

mold mold copied to clipboard

No perf benefit with Bazel

Bazel link all C unity tests (965 binaries)

Built with --jobs=1

MOLD

LLD

Built parallelized

MOLD

LLD

Bazel link all C++ google tests (592 binaries)

Built with --jobs=1

MOLD

LLD

Built parallelized

MOLD

LLD

Bazel link mixed C/C++ product binaries (68 binaries)

Built with --jobs=1

MOLD

LLD

Built parallelized

MOLD

LLD

mold
mold copied to clipboard