mold
mold copied to clipboard
Comparing `lld` 16.0 and `mold` 1.11.0
The recently-released lld 16.0 has seen some performance optimization work, so I wanted to see how it compares to mold, and share the numbers I got.
The application linked is a debug build of a large, statically-linked C++20 Linux x86_64 application built with GCC 12. The build machine is:
OS: Ubuntu 20.04 with kernel 5.4. Governor set to "performance", and Turbo boost is disabled CPU: Dual socket Intel Xeon Gold 5218, (2x 16 cores, with hyperthreading enabled) RAM: 128GB Disk: NVMe SSD
I also added numbers for lld 12, since it's what I've been using so far. I've only tried 4 and 8 threads. Each benchmark is run 10 times, + 3 warmup runs
LLD 12:
Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_4_threads
Time (mean ± σ): 11.636 s ± 0.033 s
Range (min … max): 11.572 s … 11.673 s
Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_8_threads
Time (mean ± σ): 8.785 s ± 0.043 s
Range (min … max): 8.708 s … 8.861 s
LLD 16:
Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_4_threads
Time (mean ± σ): 9.159 s ± 0.036 s
Range (min … max): 9.108 s … 9.226 s
Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_8_threads
Time (mean ± σ): 6.069 s ± 0.015 s
Range (min … max): 6.053 s … 6.104 s
mold v1.11.0
Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_4_threads_mold
Time (mean ± σ): 8.509 s ± 0.023 s
Range (min … max): 8.474 s … 8.556 s
Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_8_threads_mold
Time (mean ± σ): 4.458 s ± 0.023 s
Range (min … max): 4.432 s … 4.504 s
mold is still faster, but it looks like the gap is now smaller.
Thank you for sharing the numbers! I'm happy to see the improvements of the lld linker. That's a healthy competition that benefits all users. That being said, I expected a little more from mold. On my machine, mold generally takes 1 second per 1 GiB output. How large is your output file?
If you pin the linker process to a single core, does that make any difference in speed?
Can you run the linker with -Wl,--perf and copy-n-paste the output?
Thank you for sharing the numbers! I'm happy to see the improvements of the lld linker. That's a healthy competition that benefits all users. That being said, I expected a little more from mold. On my machine, mold generally takes 1 second per 1 GiB output. How large is your output file?
The final executable is 1.6GB. If I strip the debug symbols, the final executable is 158MB large.
If you pin the linker process to a single core, does that make any difference in speed?
Definitely. Here are the results with 4 threads all pinned to 1 core:
Benchmark 1: numactl --physcpubind=0 --membind=0 ./link_4_threads_mold
Time (mean ± σ): 33.384 s ± 0.102 s
Range (min … max): 33.312 s … 33.501 s
Which correlates well with the numbers shared in the issue description, where linking with 4 threads took ~8.5 seconds.
Can you run the linker with
-Wl,--perfand copy-n-paste the output?
$ ./link_4_threads_mold
User System Real Name
31.921 6.338 9.744 all
4.040 0.675 1.188 read_input_files
27.889 5.666 8.555 total
14.885 3.968 4.832 before_copy
0.000 0.000 0.000 apply_exclude_libs
0.454 0.339 0.200 resolve_symbols
0.241 0.174 0.106 extract_archive_members
0.141 0.164 0.076 eliminate_comdats
0.109 0.003 0.028 kill_eh_frame_sections
8.304 2.557 2.771 resolve_section_pieces
0.000 0.000 0.000 convert_common_symbols
0.000 0.000 0.000 apply_version_script
0.043 0.100 0.036 compute_import_export
1.270 0.208 0.376 compute_merged_section_sizes
0.039 0.000 0.010 check_duplicate_symbols
0.045 0.000 0.012 check_symbol_types
0.200 0.048 0.094 create_output_sections
0.017 0.000 0.005 claim_unresolved_symbols
0.006 0.000 0.002 sort_init_fini
0.000 0.000 0.000 sort_ctor_dtor
0.288 0.246 0.134 scan_relocations
0.034 0.001 0.009 compute_section_sizes
0.000 0.000 0.000 DynsymSection::finalize
0.002 0.000 0.000 fill_verneed
0.070 0.039 0.028 compute_symtab_size
0.019 0.033 0.013 eh_frame
3.979 0.392 1.111 GdbIndexSection::construct
0.000 0.000 0.000 set_osec_offsets
0.000 0.000 0.000 open_file
13.003 1.698 3.723 copy
6.831 1.392 2.092 copy_chunks
0.000 0.000 0.000 EHDR
0.000 0.000 0.000 PHDR
0.000 0.000 0.000 .interp
0.000 0.000 0.000 .note.gnu.build-id
0.000 0.000 0.000 .note.ABI-tag
0.001 0.000 0.001 .hash
0.000 0.000 0.000 .data.rel.ro
0.293 0.094 0.099 .eh_frame
0.000 0.000 0.000 .fini_array
0.004 0.000 0.003 .init_array
1.171 0.430 0.401 .symtab
0.000 0.000 0.000 .gnu.hash
0.000 0.000 0.000 .dynsym
0.000 0.000 0.000 .dynstr
0.000 0.000 0.000 .gnu.version
0.000 0.000 0.000 .gnu.version_r
0.000 0.000 0.000 .got.plt
0.006 0.000 0.003 .data
0.000 0.000 0.000 .got
0.000 0.000 0.000 .copyrel.rel.ro
0.000 0.000 0.000 .relro_padding
0.000 0.000 0.000 .plt
0.000 0.000 0.000 .fini
0.000 0.000 0.000 .init
1.635 0.538 0.542 .text
0.000 0.000 0.000 .tm_clone_table
0.000 0.000 0.000 .copyrel
0.000 0.000 0.000 .bss
0.000 0.000 0.000 .strtab
6.824 1.392 2.088 .debug_info
0.000 0.000 0.000 .eh_frame_hdr
0.039 0.024 0.015 .gcc_except_table
0.206 0.127 0.082 .rodata
0.000 0.000 0.000 .rodata.cst
0.000 0.000 0.000 .tbss
0.000 0.000 0.000 .dynamic
0.000 0.000 0.000 .debug_loclists
0.918 0.231 0.285 .debug_rnglists
0.000 0.000 0.000 .comment
0.246 0.060 0.078 .debug_abbrev
0.591 0.144 0.182 .debug_aranges
0.000 0.000 0.000 .shstrtab
0.000 0.000 0.000 .gdb_index
1.966 0.404 0.594 .debug_line
2.833 0.592 0.857 .debug_str
0.001 0.000 0.001 .debug_line_str
0.000 0.000 0.000 SHDR
0.000 0.000 0.000 .rela.dyn
0.000 0.000 0.000 .rela.plt
0.043 0.159 0.051 GdbIndexSection::write_address_areas
0.000 0.000 0.000 sort_dynamic_relocs
0.000 0.000 0.000 clear_padding
6.128 0.148 1.580 build_id
0.000 0.000 0.000 close_file
I'm sorry, I wanted to ask you to run it with pinning to a single socket than a core. I wondered if inter-socket traffic is a bottleneck.
I think --cpunodebind=0 --membind=0 already binds it to the first socket. At least this was how it was tested in #937.
I think
--cpunodebind=0 --membind=0already binds it to the first socket. At least this was how it was tested in #937.
That's correct. All the numbers shared in this issue are collected by prepending numactl --cpunodebind=0 --membind=0, which does bind to a single NUMA node. Without this performance is, say, 20% worse, and there is a lot more variance (as @ishitatsuyuki said, this was discovered in #937 )
mold is faster, but I think a fair comparison needs to link mimalloc.a into lld, as mimalloc is responsible for 10+% performance improvement, e.g.
-DCMAKE_EXE_LINKER_FLAGS=$HOME/Dev/mimalloc/out/release/libmimalloc.a -DLLVM_ENABLE_PIC=off -DCMAKE_{C,CXX}_FLAGS=-fno-asynchronous-unwind-tables
mold is faster, but I think a fair comparison needs to link mimalloc.a into lld, as mimalloc is responsible for 10+% performance improvement, e.g.
-DCMAKE_EXE_LINKER_FLAGS=$HOME/Dev/mimalloc/out/release/libmimalloc.a -DLLVM_ENABLE_PIC=off -DCMAKE_{C,CXX}_FLAGS=-fno-asynchronous-unwind-tables
While I agree there would be value in benchmarking lld + mimalloc vs mold, I don't think the comparison as it's currently done is "unfair": I expect most lld users to get lld from their package manager (as we currently do), and to my knowledge (please correct me if I'm wrong), those packages won't be linked against mimalloc.
Defaults matter, as it's what most people will use.