mold
mold copied to clipboard
Speed comparison with LLD of Clang program
The README page mentions the following benchmark: Clang 19 (1.56 GiB) 42.07s 33.13s 5.20s 1.35s, but I cannot reproduce it on my AMD machine. First, am I right about the binary size (1.56 GiB) is measured with debug info? If so, did you use -DCMAKE_BUILD_TYPE=RelWithDebInfo or something else? Have you used any --compress-debug-sections= option?
My numbers for AMD Ryzen 9 7900X 12-Core Processor are:
❯ bloaty ../../../../bin/clang-20
FILE SIZE VM SIZE
-------------- --------------
74.3% 3.31Gi 0.0% 0 .debug_info
9.4% 427Mi 0.0% 0 .debug_loclists
5.1% 231Mi 0.0% 0 .debug_str
4.8% 220Mi 0.0% 0 .debug_line
2.1% 95.8Mi 59.8% 95.8Mi .text
1.7% 76.4Mi 0.0% 0 .debug_rnglists
0.9% 41.3Mi 25.8% 41.3Mi .rodata
0.5% 23.6Mi 0.0% 0 .debug_abbrev
0.5% 23.5Mi 0.0% 0 .strtab
0.2% 9.72Mi 6.1% 9.72Mi .eh_frame
0.1% 5.44Mi 0.0% 0 .symtab
0.1% 4.49Mi 2.8% 4.49Mi .dynstr
0.1% 4.22Mi 2.6% 4.22Mi .data.rel.ro
0.1% 3.64Mi 0.0% 0 .debug_aranges
0.0% 1.30Mi 0.8% 1.30Mi .dynsym
0.0% 1.24Mi 0.8% 1.24Mi .eh_frame_hdr
0.0% 0 0.4% 715Ki .bss
0.0% 505Ki 0.3% 505Ki [24 Others]
0.0% 444Ki 0.3% 444Ki .hash
0.0% 404Ki 0.2% 404Ki .gnu.hash
0.0% 365Ki 0.0% 0 .debug_line_str
100.0% 4.45Gi 100.0% 160Mi TOTAL
❯ hyperfine ... -fuse-ld=mold
Time (mean ± σ): 2.802 s ± 0.110 s [User: 0.009 s, System: 0.003 s]
Range (min … max): 2.658 s … 2.999 s 10 runs
❯hyperfine ... -fuse-ld=lld
Time (mean ± σ): 4.160 s ± 0.225 s [User: 40.475 s, System: 13.092 s]
Range (min … max): 3.604 s … 4.428 s 10 runs
❯ ld.lld --version
LLD 19.1.0 (compatible with GNU linkers)
❯ mold --version
mold 2.33.0 (compatible with GNU ld)
Both LLD and Mold are provided from openSUSE package (built with LTO). Compared to your numbers, LLD is 1.48x slower, while your numbers claim it's 3.85x. Can you please remeasure it?
I format to mention that the object files are using ZSTD compression for the debug info sections.
And my perf report looks as follows:
25.61% ld.mold mold [.] mold::elf::MergeableSection<mold::elf::X86_64>::get_fragment(long)
10.70% ld.mold mold [.] mold::elf::InputSection<mold::elf::X86_64>::apply_reloc_nonalloc(mold::elf::Context<mold::elf::X86_64>&, unsigned char*)
9.97% ld.mold libc.so.6 [.] __memcmp_evex_movbe
8.35% ld.mold mold [.] mold::elf::MergedSection<mold::elf::X86_64>::insert(mold::elf::Context<mold::elf::X86_64>&, std::basic_string_view<char, std::char_traits<char> >, unsigned long, long) [clone .isra.0]
5.43% ld.mold mold [.] mold::elf::MergeableSection<mold::elf::X86_64>::split_contents(mold::elf::Context<mold::elf::X86_64>&)
4.89% ld.mold mold [.] blake3_hash_many_avx512
2.65% ld.mold mold [.] mold::elf::find_null(std::basic_string_view<char, std::char_traits<char> >, long, long) [clone .lto_priv.17] [clone .lto_priv.0]
2.20% ld.mold mold [.] mold::elf::InputSection<mold::elf::X86_64>::get_fragment(mold::elf::Context<mold::elf::X86_64>&, mold::elf::ElfRel<mold::elf::X86_64> const&) [clone .isra.0]
2.05% ld.mold libzstd.so.1.5.6 [.] 0x000000000006c393
1.95% ld.mold mold [.] mold::elf::InputSection<mold::elf::X86_64>::record_undef_error(mold::elf::Context<mold::elf::X86_64>&, mold::elf::ElfRel<mold::elf::X86_64> const&)
1.04% ld.mold mold [.] mold::elf::InputSection<mold::elf::X86_64>::get_tombstone(mold::elf::Symbol<mold::elf::X86_64>&, mold::elf::SectionFragment<mold::elf::X86_64>*)
0.90% ld.mold mold [.] mold::elf::MergeableSection<mold::elf::X86_64>::get_contents(long)
0.71% ld.mold libc.so.6 [.] __memchr_evex
0.71% ld.mold libc.so.6 [.] __memmove_avx512_unaligned_erms
0.67% ld.mold mold [.] mold::Integer<long, (std::endian)1234, 8>::operator long() const [clone .isra.0]
0.55% ld.mold mold [.] mold::elf::Symbol<mold::elf::X86_64>::get_addr(mold::elf::Context<mold::elf::X86_64>&, long) const
0.51% ld.mold mold [.] mold::elf::MergedSection<mold::elf::X86_64>::compute_section_size(mold::elf::Context<mold::elf::X86_64>&)::{lambda(long)#1}::operator()(long) const
I ran a quick benchmark again and the number seems consistent.
ruiu@odyssey:~/llvm-project/b$ taskset -c 0-31 hyperfine 'mold @rsp' 'ld.lld @rsp'
Benchmark 1: mold @rsp
Time (mean ± σ): 1.636 s ± 0.018 s [User: 0.005 s, System: 0.003 s]
Range (min … max): 1.609 s … 1.663 s 10 runs
Benchmark 2: ld.lld @rsp
Time (mean ± σ): 5.985 s ± 0.025 s [User: 27.125 s, System: 16.205 s]
Range (min … max): 5.946 s … 6.018 s 10 runs
Summary
mold @rsp ran
3.66 ± 0.04 times faster than ld.lld @rsp
Interesting! Can you please provide the output of bloaty for the linked binary? And perf report for Mold linker ;)
$ ~/bloaty/build/bloaty bin/clang-19
FILE SIZE VM SIZE
-------------- --------------
70.2% 2.85Gi 0.0% 0 .debug_info
6.0% 249Mi 0.0% 0 .debug_str
5.8% 240Mi 0.0% 0 .strtab
4.7% 196Mi 0.0% 0 .debug_line
4.6% 190Mi 49.5% 190Mi .text
2.7% 111Mi 29.1% 111Mi .rodata
1.4% 58.8Mi 0.0% 0 .symtab
1.3% 55.6Mi 0.0% 0 .debug_aranges
0.9% 38.6Mi 10.0% 38.6Mi .eh_frame
0.8% 35.3Mi 0.0% 0 .debug_rnglists
0.6% 23.8Mi 0.0% 0 .debug_abbrev
0.3% 12.1Mi 3.2% 12.1Mi .rela.dyn
0.2% 9.44Mi 2.5% 9.44Mi .eh_frame_hdr
0.2% 8.31Mi 2.2% 8.31Mi .data.rel.ro
0.2% 7.75Mi 2.0% 7.75Mi .dynstr
0.0% 0 0.7% 2.87Mi .bss
0.0% 1.92Mi 0.5% 1.92Mi .dynsym
0.0% 495Ki 0.1% 495Ki .gnu.hash
0.0% 429Ki 0.1% 429Ki .data
0.0% 315Ki 0.0% 0 .debug_line_str
0.0% 252Ki 0.1% 252Ki [27 Others]
100.0% 4.06Gi 100.0% 383Mi TOTAL
And here is my perf report.
+ 29.04% 0.19% mold [kernel.kallsyms] [k] asm_exc_page_fault
+ 28.50% 0.15% mold [kernel.kallsyms] [k] exc_page_fault
+ 28.03% 0.15% mold [kernel.kallsyms] [k] do_user_addr_fault
+ 27.46% 0.18% mold [kernel.kallsyms] [k] handle_mm_fault
+ 26.93% 0.29% mold [kernel.kallsyms] [k] __handle_mm_fault
+ 26.56% 0.07% mold [kernel.kallsyms] [k] handle_pte_fault
+ 21.83% 0.10% mold [kernel.kallsyms] [k] do_fault
+ 21.72% 0.88% mold libc.so.6 [.] __memmove_avx512_unaligned_erms
+ 17.52% 0.02% mold [kernel.kallsyms] [k] do_page_mkwrite
+ 17.48% 0.43% mold [kernel.kallsyms] [k] ext4_page_mkwrite
+ 16.63% 0.05% mold [kernel.kallsyms] [k] block_page_mkwrite
+ 16.08% 7.61% mold mold [.] mold::MergeableSection<mold::X86_64>::get_fragment(long)
+ 15.95% 0.04% mold [kernel.kallsyms] [k] mark_buffer_dirty
+ 15.91% 15.73% mold [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
+ 15.81% 0.06% mold [kernel.kallsyms] [k] __block_commit_write
+ 15.70% 0.04% mold [kernel.kallsyms] [k] __folio_mark_dirty
+ 15.54% 0.00% mold [unknown] [k] 0000000000000000
+ 15.03% 0.03% mold [kernel.kallsyms] [k] __raw_spin_lock_irqsave
+ 15.03% 0.01% mold [kernel.kallsyms] [k] _raw_spin_lock_irqsave
+ 13.13% 7.72% mold mold [.] mold::MergedSection<mold::X86_64>::insert(mold::Context<mold::X86_64>&, std::basic_string_view<char, std::char_traits<char> >, unsigned long, long)
+ 11.07% 7.42% mold mold [.] mold::InputSection<mold::X86_64>::apply_reloc_nonalloc(mold::Context<mold::X86_64>&, unsigned char*)
+ 9.87% 0.19% mold libc.so.6 [.] __sched_yield
+ 9.53% 0.20% mold [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 9.40% 0.32% mold [kernel.kallsyms] [k] do_syscall_64
+ 8.65% 0.16% mold [kernel.kallsyms] [k] x64_sys_call
+ 8.29% 1.23% mold libc.so.6 [.] __memcmp_evex_movbe
+ 7.96% 0.10% mold [kernel.kallsyms] [k] __x64_sys_sched_yield
+ 7.54% 0.92% mold [kernel.kallsyms] [k] do_sched_yield
+ 6.16% 0.14% mold [kernel.kallsyms] [k] schedule
+ 5.85% 0.71% mold [kernel.kallsyms] [k] __schedule
+ 5.62% 4.66% mold mold [.] mold::MergeableSection<mold::X86_64>::resolve_contents(mold::Context<mold::X86_64>&)
+ 4.54% 0.16% mold [kernel.kallsyms] [k] pick_next_task
+ 4.41% 4.35% mold mold [.] blake3_hash_many_avx512
+ 4.41% 1.19% mold [kernel.kallsyms] [k] pick_next_task_fair
+ 4.36% 0.00% mold [unknown] [k] 0x0000000000004000
+ 4.13% 2.51% mold libc.so.6 [.] __memchr_evex
+ 4.01% 2.08% mold mold [.] mold::MergeableSection<mold::X86_64>::split_contents(mold::Context<mold::X86_64>&)
+ 3.38% 2.60% mold mold [.] mold::InputSection<mold::X86_64>::record_undef_error(mold::Context<mold::X86_64>&, mold::ElfRel<mold::X86_64> const&)
+ 2.91% 1.37% mold [kernel.kallsyms] [k] update_curr
+ 2.48% 0.15% mold [kernel.kallsyms] [k] do_anonymous_page
+ 2.28% 0.00% mold [unknown] [.] 0x49544100ee86e305
+ 2.28% 0.00% mold mold [.] mold::ObjectFile<mold::X86_64>::~ObjectFile()
+ 2.17% 0.01% mold [kernel.kallsyms] [k] do_read_fault
+ 2.12% 0.20% mold [kernel.kallsyms] [k] filemap_map_pages
+ 1.98% 0.58% mold [kernel.kallsyms] [k] set_pte_range
+ 1.97% 1.00% mold mold [.] mold::ObjectFile<mold::X86_64>::initialize_sections(mold::Context<mold::X86_64>&)
+ 1.83% 1.07% mold mold [.] mold::Symbol<mold::X86_64>* mold::get_symbol<mold::X86_64>(mold::Context<mold::X86_64>&, std::basic_string_view<char, std::char_traits<char> >, std::basic_s
+ 1.79% 0.00% mold [kernel.kallsyms] [k] do_wp_page
+ 1.78% 0.01% mold [kernel.kallsyms] [k] wp_page_copy
+ 1.64% 1.36% mold mold [.] mold::ObjectFile<mold::X86_64>::resolve_symbols(mold::Context<mold::X86_64>&)
+ 1.52% 0.87% mold [kernel.kallsyms] [k] srso_alias_safe_ret
+ 1.47% 0.00% mold [kernel.kallsyms] [k] flush_tlb_mm_range
+ 1.45% 0.00% mold [kernel.kallsyms] [k] on_each_cpu_cond_mask
+ 1.45% 0.00% mold [kernel.kallsyms] [k] native_flush_tlb_multi
+ 1.44% 0.00% mold [kernel.kallsyms] [k] ptep_clear_flush
+ 1.41% 0.30% mold [kernel.kallsyms] [k] _raw_spin_lock
+ 1.41% 0.77% mold libc.so.6 [.] __strlen_evex
1.33% 1.27% mold mold [.] XXH_INLINE_XXH3_64bits
+ 1.23% 0.60% mold mold [.] tbb::detail::d2::concurrent_hash_map<std::basic_string_view<char, std::char_traits<char> >, mold::Symbol<mold::X86_64>, HashCmp, tbb::detail::d1::tbb_alloca
+ 1.18% 0.19% mold mold [.] tbb::detail::d2::concurrent_hash_map<std::basic_string_view<char, std::char_traits<char> >, mold::ComdatGroup, HashCmp, tbb::detail::d1::tbb_allocator<std::
+ 1.16% 0.02% mold [kernel.kallsyms] [k] vma_alloc_folio
+ 1.16% 0.02% mold [kernel.kallsyms] [k] alloc_anon_folio
+ 1.09% 0.86% mold mold [.] mold::ObjectFile<mold::X86_64>::reattach_section_pieces(mold::Context<mold::X86_64>&)
+ 1.06% 0.02% mold [kernel.kallsyms] [k] alloc_pages_mpol
+ 1.01% 0.92% mold [kernel.kallsyms] [k] __mod_node_page_state
+ 1.01% 0.00% mold [unknown] [.] 0xcccccccccccccccc
+ 0.92% 0.09% mold [kernel.kallsyms] [k] asm_sysvec_call_function
+ 0.91% 0.82% mold mold [.] mold::ObjectFile<mold::X86_64>::compute_symtab_size(mold::Context<mold::X86_64>&)
+ 0.91% 0.06% mold [kernel.kallsyms] [k] __alloc_pages
+ 0.90% 0.55% mold [kernel.kallsyms] [k] smp_call_function_many_cond
+ 0.89% 0.02% mold [kernel.kallsyms] [k] __do_fault
0.88% 0.76% mold [kernel.kallsyms] [k] srso_alias_return_thunk
+ 0.87% 0.80% mold mold [.] tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data
+ 0.85% 0.08% mold [kernel.kallsyms] [k] folio_add_file_rmap_ptes
0.85% 0.42% mold [kernel.kallsyms] [k] ext4_dirty_folio
+ 0.83% 0.23% mold [kernel.kallsyms] [k] __lruvec_stat_mod_folio
+ 0.83% 0.04% mold [kernel.kallsyms] [k] filemap_fault
Note that I built it not with RelWithDebInfo but with Debug .
Thanks, if filter out only the function related to mold, then the profile is very comparable to what I see. Interestingly, your CPU (which is faster) behaves quite differently, where LLD is much slower than on my machine and mold is faster.
Anyway, please update README.md where Clang has the following binary size: Clang 19 (1.56 GiB). I think it should be 4.06GiB, right?
The README page mentions the following benchmark:
Clang 19 (1.56 GiB) 42.07s 33.13s 5.20s 1.35s, but I cannot reproduce it on my AMD machine. First, am I right about the binary size (1.56 GiB) is measured with debug info? If so, did you use-DCMAKE_BUILD_TYPE=RelWithDebInfoor something else? Have you used any--compress-debug-sections=option?My numbers for
AMD Ryzen 9 7900X 12-Core Processorare:❯ bloaty ../../../../bin/clang-20 FILE SIZE VM SIZE -------------- -------------- 74.3% 3.31Gi 0.0% 0 .debug_info 9.4% 427Mi 0.0% 0 .debug_loclists 5.1% 231Mi 0.0% 0 .debug_str 4.8% 220Mi 0.0% 0 .debug_line 2.1% 95.8Mi 59.8% 95.8Mi .text 1.7% 76.4Mi 0.0% 0 .debug_rnglists 0.9% 41.3Mi 25.8% 41.3Mi .rodata 0.5% 23.6Mi 0.0% 0 .debug_abbrev 0.5% 23.5Mi 0.0% 0 .strtab 0.2% 9.72Mi 6.1% 9.72Mi .eh_frame 0.1% 5.44Mi 0.0% 0 .symtab 0.1% 4.49Mi 2.8% 4.49Mi .dynstr 0.1% 4.22Mi 2.6% 4.22Mi .data.rel.ro 0.1% 3.64Mi 0.0% 0 .debug_aranges 0.0% 1.30Mi 0.8% 1.30Mi .dynsym 0.0% 1.24Mi 0.8% 1.24Mi .eh_frame_hdr 0.0% 0 0.4% 715Ki .bss 0.0% 505Ki 0.3% 505Ki [24 Others] 0.0% 444Ki 0.3% 444Ki .hash 0.0% 404Ki 0.2% 404Ki .gnu.hash 0.0% 365Ki 0.0% 0 .debug_line_str 100.0% 4.45Gi 100.0% 160Mi TOTAL ❯ hyperfine ... -fuse-ld=mold Time (mean ± σ): 2.802 s ± 0.110 s [User: 0.009 s, System: 0.003 s] Range (min … max): 2.658 s … 2.999 s 10 runs ❯hyperfine ... -fuse-ld=lld Time (mean ± σ): 4.160 s ± 0.225 s [User: 40.475 s, System: 13.092 s] Range (min … max): 3.604 s … 4.428 s 10 runs ❯ ld.lld --version LLD 19.1.0 (compatible with GNU linkers) ❯ mold --version mold 2.33.0 (compatible with GNU ld)Both LLD and Mold are provided from openSUSE package (built with LTO). Compared to your numbers, LLD is 1.48x slower, while your numbers claim it's 3.85x. Can you please remeasure it?
bloaty ../../../../bin/clang-20
That's clang 20, not clang 19.
That's clang 20, not clang 19.
Yeah, but these two are very similar in size.
That's clang 20, not clang 19.
Yeah, but these two are very similar in size.
Yes, but the size doesn't really matter, it's like saying if I statically include 5,000 libraries and my executable is 5 MB but if I dynamically include those same libraries that the linker performance is the same, but the executable is only 3 MB. Statically linking something requires adding it into the binary, so the linker should be slower on static libraries. Dynamic linking, however, shouldn't need to link a shared object / DLL at compile time, the linker should only need to link the program and the static libraries that the program depends on. Also, you have changes made between clang 19 and clang 20, for example, it might require another library to statically link with, or replace a statically linked library with a dynamically linked library, or it may require more static libraries.
mimalloc improves lld's performance by more than 10% compared with glibc malloc. The following cmake command for llvm-project will mostly match the mold default.
# -fvisibility-inlines-hidde -O3 -DNDEBUG -std=c++20 -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-rtti
cmake -GNinja -Sllvm -Bout/release -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=$HOME/Stable/bin/clang++ -DCMAKE_C_COMPILER=$HOME/Stable/bin/clang -DCMAKE_POSITION_INDEPENDENT_CODE=off -DCMAKE_CXX_STANDARD=20 -DLLVM_ENABLE_UNWIND_TABLES=off -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DLLVM_ENABLE_PIC=off -DCMAKE_EXE_LINKER_FLAGS=$HOME/Dev/mimalloc/out/release/libmimalloc.a
-DCMAKE_POSITION_INDEPENDENT_CODE=off -DLLVM_ENABLE_PIC=off is needed to downgrade -fPIC to -fPIE.
Configure mold like this:
cmake -GNinja -S. -Bout/release -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=~/Stable/bin/clang++ -DCMAKE_C_COMPILER=~/Stable/bin/clang -DCMAKE_CXX_FLAGS='-fvisibility-inlines-hidden'
You can find fast (PGO+bolt) prebuilt clang executables at https://mirrors.edge.kernel.org/pub/tools/llvm/ and https://raw.githubusercontent.com/chromium/chromium/main/tools/clang/scripts/update.py
My local lld with mimalloc is faster than Arch Linux's lld (libLLVM.so,liblld*.so)
% hyperfine --warmup 1 --min-runs 10 -s 'rm -f a.out' 'numactl -C 0-7 '{/usr/bin/ld.lld,~/llvm/out/release/bin/ld.lld}' --threads=8 @response.txt -o a.out'
Benchmark 1: numactl -C 0-7 /usr/bin/ld.lld --threads=8 @response.txt -o a.out
Time (mean ± σ): 628.4 ms ± 11.1 ms [User: 977.1 ms, System: 366.4 ms]
Range (min … max): 609.0 ms … 647.7 ms 10 runs
Benchmark 2: numactl -C 0-7 ~/llvm/out/release/bin/ld.lld --threads=8 @response.txt -o a.out
Time (mean ± σ): 554.2 ms ± 12.1 ms [User: 1021.7 ms, System: 227.7 ms]
Range (min … max): 539.2 ms … 576.3 ms 10 runs
Summary
numactl -C 0-7 ~/llvm/out/release/bin/ld.lld --threads=8 @response.txt -o a.out ran
1.13 ± 0.03 times faster than numactl -C 0-7 /usr/bin/ld.lld --threads=8 @response.txt -o a.out
rm -f a.out makes the comparison even fairer for lld vs mold.
Perhaps lld should have mold's --no-fork as another optimization?
-DLLVM_ENABLE_ASSERTIONS={on,off} and -DLLVM_TARGETS_TO_BUILD=host vs -DLLVM_TARGETS_TO_BUILD=all can affect Clang sizes a lot.