mold icon indicating copy to clipboard operation
mold copied to clipboard

Comparing `lld` 16.0 and `mold` 1.11.0

Open moncefmechri opened this issue 2 years ago • 7 comments
trafficstars

The recently-released lld 16.0 has seen some performance optimization work, so I wanted to see how it compares to mold, and share the numbers I got.

The application linked is a debug build of a large, statically-linked C++20 Linux x86_64 application built with GCC 12. The build machine is:

OS: Ubuntu 20.04 with kernel 5.4. Governor set to "performance", and Turbo boost is disabled CPU: Dual socket Intel Xeon Gold 5218, (2x 16 cores, with hyperthreading enabled) RAM: 128GB Disk: NVMe SSD

I also added numbers for lld 12, since it's what I've been using so far. I've only tried 4 and 8 threads. Each benchmark is run 10 times, + 3 warmup runs

LLD 12:

Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_4_threads
  Time (mean ± σ):     11.636 s ±  0.033 s
  Range (min … max):   11.572 s … 11.673 s


Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_8_threads
  Time (mean ± σ):      8.785 s ±  0.043 s
  Range (min … max):    8.708 s …  8.861 s

LLD 16:

Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_4_threads
  Time (mean ± σ):      9.159 s ±  0.036 s
  Range (min … max):    9.108 s …  9.226 s

Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_8_threads
  Time (mean ± σ):      6.069 s ±  0.015 s
  Range (min … max):    6.053 s …  6.104 s

mold v1.11.0

Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_4_threads_mold
  Time (mean ± σ):      8.509 s ±  0.023 s
  Range (min … max):    8.474 s …  8.556 s
  
Benchmark 1: numactl --cpunodebind=0 --membind=0 ./link_8_threads_mold
  Time (mean ± σ):      4.458 s ±  0.023 s
  Range (min … max):    4.432 s …  4.504 s

mold is still faster, but it looks like the gap is now smaller.

moncefmechri avatar Mar 20 '23 12:03 moncefmechri

Thank you for sharing the numbers! I'm happy to see the improvements of the lld linker. That's a healthy competition that benefits all users. That being said, I expected a little more from mold. On my machine, mold generally takes 1 second per 1 GiB output. How large is your output file?

If you pin the linker process to a single core, does that make any difference in speed?

Can you run the linker with -Wl,--perf and copy-n-paste the output?

rui314 avatar Mar 20 '23 12:03 rui314

Thank you for sharing the numbers! I'm happy to see the improvements of the lld linker. That's a healthy competition that benefits all users. That being said, I expected a little more from mold. On my machine, mold generally takes 1 second per 1 GiB output. How large is your output file?

The final executable is 1.6GB. If I strip the debug symbols, the final executable is 158MB large.

If you pin the linker process to a single core, does that make any difference in speed?

Definitely. Here are the results with 4 threads all pinned to 1 core:

Benchmark 1: numactl --physcpubind=0  --membind=0 ./link_4_threads_mold
  Time (mean ± σ):     33.384 s ±  0.102 s
  Range (min … max):   33.312 s … 33.501 s

Which correlates well with the numbers shared in the issue description, where linking with 4 threads took ~8.5 seconds.

Can you run the linker with -Wl,--perf and copy-n-paste the output?

$ ./link_4_threads_mold
     User   System     Real  Name
   31.921    6.338    9.744  all
    4.040    0.675    1.188    read_input_files
   27.889    5.666    8.555    total
   14.885    3.968    4.832      before_copy
    0.000    0.000    0.000        apply_exclude_libs
    0.454    0.339    0.200        resolve_symbols
    0.241    0.174    0.106          extract_archive_members
    0.141    0.164    0.076          eliminate_comdats
    0.109    0.003    0.028        kill_eh_frame_sections
    8.304    2.557    2.771        resolve_section_pieces
    0.000    0.000    0.000        convert_common_symbols
    0.000    0.000    0.000        apply_version_script
    0.043    0.100    0.036        compute_import_export
    1.270    0.208    0.376        compute_merged_section_sizes
    0.039    0.000    0.010        check_duplicate_symbols
    0.045    0.000    0.012        check_symbol_types
    0.200    0.048    0.094        create_output_sections
    0.017    0.000    0.005        claim_unresolved_symbols
    0.006    0.000    0.002        sort_init_fini
    0.000    0.000    0.000        sort_ctor_dtor
    0.288    0.246    0.134        scan_relocations
    0.034    0.001    0.009        compute_section_sizes
    0.000    0.000    0.000        DynsymSection::finalize
    0.002    0.000    0.000        fill_verneed
    0.070    0.039    0.028        compute_symtab_size
    0.019    0.033    0.013        eh_frame
    3.979    0.392    1.111        GdbIndexSection::construct
    0.000    0.000    0.000        set_osec_offsets
    0.000    0.000    0.000      open_file
   13.003    1.698    3.723      copy
    6.831    1.392    2.092        copy_chunks
    0.000    0.000    0.000          EHDR
    0.000    0.000    0.000          PHDR
    0.000    0.000    0.000          .interp
    0.000    0.000    0.000          .note.gnu.build-id
    0.000    0.000    0.000          .note.ABI-tag
    0.001    0.000    0.001          .hash
    0.000    0.000    0.000          .data.rel.ro
    0.293    0.094    0.099          .eh_frame
    0.000    0.000    0.000          .fini_array
    0.004    0.000    0.003          .init_array
    1.171    0.430    0.401          .symtab
    0.000    0.000    0.000          .gnu.hash
    0.000    0.000    0.000          .dynsym
    0.000    0.000    0.000          .dynstr
    0.000    0.000    0.000          .gnu.version
    0.000    0.000    0.000          .gnu.version_r
    0.000    0.000    0.000          .got.plt
    0.006    0.000    0.003          .data
    0.000    0.000    0.000          .got
    0.000    0.000    0.000          .copyrel.rel.ro
    0.000    0.000    0.000          .relro_padding
    0.000    0.000    0.000          .plt
    0.000    0.000    0.000          .fini
    0.000    0.000    0.000          .init
    1.635    0.538    0.542          .text
    0.000    0.000    0.000          .tm_clone_table
    0.000    0.000    0.000          .copyrel
    0.000    0.000    0.000          .bss
    0.000    0.000    0.000          .strtab
    6.824    1.392    2.088          .debug_info
    0.000    0.000    0.000          .eh_frame_hdr
    0.039    0.024    0.015          .gcc_except_table
    0.206    0.127    0.082          .rodata
    0.000    0.000    0.000          .rodata.cst
    0.000    0.000    0.000          .tbss
    0.000    0.000    0.000          .dynamic
    0.000    0.000    0.000          .debug_loclists
    0.918    0.231    0.285          .debug_rnglists
    0.000    0.000    0.000          .comment
    0.246    0.060    0.078          .debug_abbrev
    0.591    0.144    0.182          .debug_aranges
    0.000    0.000    0.000          .shstrtab
    0.000    0.000    0.000          .gdb_index
    1.966    0.404    0.594          .debug_line
    2.833    0.592    0.857          .debug_str
    0.001    0.000    0.001          .debug_line_str
    0.000    0.000    0.000          SHDR
    0.000    0.000    0.000          .rela.dyn
    0.000    0.000    0.000          .rela.plt
    0.043    0.159    0.051        GdbIndexSection::write_address_areas
    0.000    0.000    0.000        sort_dynamic_relocs
    0.000    0.000    0.000        clear_padding
    6.128    0.148    1.580        build_id
    0.000    0.000    0.000      close_file

moncefmechri avatar Mar 20 '23 13:03 moncefmechri

I'm sorry, I wanted to ask you to run it with pinning to a single socket than a core. I wondered if inter-socket traffic is a bottleneck.

rui314 avatar Mar 21 '23 01:03 rui314

I think --cpunodebind=0 --membind=0 already binds it to the first socket. At least this was how it was tested in #937.

ishitatsuyuki avatar Mar 21 '23 02:03 ishitatsuyuki

I think --cpunodebind=0 --membind=0 already binds it to the first socket. At least this was how it was tested in #937.

That's correct. All the numbers shared in this issue are collected by prepending numactl --cpunodebind=0 --membind=0, which does bind to a single NUMA node. Without this performance is, say, 20% worse, and there is a lot more variance (as @ishitatsuyuki said, this was discovered in #937 )

moncefmechri avatar Mar 21 '23 08:03 moncefmechri

mold is faster, but I think a fair comparison needs to link mimalloc.a into lld, as mimalloc is responsible for 10+% performance improvement, e.g.

-DCMAKE_EXE_LINKER_FLAGS=$HOME/Dev/mimalloc/out/release/libmimalloc.a -DLLVM_ENABLE_PIC=off -DCMAKE_{C,CXX}_FLAGS=-fno-asynchronous-unwind-tables

MaskRay avatar Jul 27 '23 00:07 MaskRay

mold is faster, but I think a fair comparison needs to link mimalloc.a into lld, as mimalloc is responsible for 10+% performance improvement, e.g.

-DCMAKE_EXE_LINKER_FLAGS=$HOME/Dev/mimalloc/out/release/libmimalloc.a -DLLVM_ENABLE_PIC=off -DCMAKE_{C,CXX}_FLAGS=-fno-asynchronous-unwind-tables

While I agree there would be value in benchmarking lld + mimalloc vs mold, I don't think the comparison as it's currently done is "unfair": I expect most lld users to get lld from their package manager (as we currently do), and to my knowledge (please correct me if I'm wrong), those packages won't be linked against mimalloc.

Defaults matter, as it's what most people will use.

moncefmechri avatar Jul 31 '23 11:07 moncefmechri