mold [macOS] performance issue

I used the latest version of mold (694f973dc50d2d7f207849cf64f6cc1ddd00a987) on macOS 12.5, and compared its performance with Apple's built-in ld, and I find its performance not as good as I expected, in my non scientific benchmark, it is even slower than ld. Am I missing something to configure?

This is probably not a correct way to benchmark the linker, but here's how I did it currently:

My project is cmake based, I compiled and linked all the programs and generated the binaries using cmake
I deleted all the generated binary and ran the cmake build again, which will ask cmake to generate all the binaries again. Since all the other build artifacts remained, the only thing needs to be generated are the binaries, and all programs will be linked and generated.

In my current setup, each build will generate 43 binaries, with size like this:

-rwxr-xr-x  1 ss  staff   114M Aug 26 06:10 
-rwxr-xr-x  1 ss  staff   159M Aug 26 06:11 
-rwxr-xr-x  1 ss  staff    27M Aug 26 06:10 
-rwxr-xr-x  1 ss  staff   123M Aug 26 06:10 
-rwxr-xr-x  1 ss  staff   133M Aug 26 06:14 
-rwxr-xr-x  1 ss  staff   143M Aug 26 06:14 
-rwxr-xr-x  1 ss  staff    12M Aug 26 06:08 
-rwxr-xr-x  1 ss  staff    13M Aug 26 06:08 
-rwxr-xr-x  1 ss  staff    67M Aug 26 06:14 
-rwxr-xr-x  1 ss  staff   245M Aug 26 06:10 
-rwxr-xr-x  1 ss  staff    50M Aug 26 06:12 
-rwxr-xr-x  1 ss  staff   266M Aug 26 06:14 
-rwxr-xr-x  1 ss  staff    49M Aug 26 06:13 
-rwxr-xr-x  1 ss  staff   115M Aug 26 06:15 
-rwxr-xr-x  1 ss  staff    41M Aug 26 06:14 
-rwxr-xr-x  1 ss  staff    10M Aug 26 06:08 
-rwxr-xr-x  1 ss  staff    13M Aug 26 06:12 
-rwxr-xr-x  1 ss  staff   248M Aug 26 06:15 
-rwxr-xr-x  1 ss  staff    35M Aug 26 06:11 
-rwxr-xr-x  1 ss  staff    57M Aug 26 06:12 
-rwxr-xr-x  1 ss  staff    12M Aug 26 06:13 
-rwxr-xr-x  1 ss  staff    11M Aug 26 06:10 
-rwxr-xr-x  1 ss  staff   651M Aug 26 06:11 
-rwxr-xr-x  1 ss  staff   141M Aug 26 06:15 
-rwxr-xr-x  1 ss  staff    11M Aug 26 06:14 
-rwxr-xr-x  1 ss  staff   763M Aug 26 06:11 
-rwxr-xr-x  1 ss  staff    87M Aug 26 06:12 
-rwxr-xr-x  1 ss  staff   643M Aug 26 06:15 
-rwxr-xr-x  1 ss  staff    11M Aug 26 06:13 
-rwxr-xr-x  1 ss  staff    94M Aug 26 06:14 
-rwxr-xr-x  1 ss  staff    70M Aug 26 06:14 
-rwxr-xr-x  1 ss  staff   166M Aug 26 06:14 
-rwxr-xr-x  1 ss  staff   249M Aug 26 06:14 
-rwxr-xr-x  1 ss  staff    11M Aug 26 06:15 
-rwxr-xr-x  1 ss  staff    80M Aug 26 06:12 
-rwxr-xr-x  1 ss  staff    16M Aug 26 06:08 
-rwxr-xr-x  1 ss  staff    18M Aug 26 06:08 
-rwxr-xr-x  1 ss  staff    20M Aug 26 06:08 
-rwxr-xr-x  1 ss  staff    16M Aug 26 06:12 
-rwxr-xr-x  1 ss  staff   2.3M Aug 26 06:12 
-rwxr-xr-x  1 ss  staff   9.6M Aug 26 06:14 
-rwxr-xr-x  1 ss  staff   7.3M Aug 26 06:08 
-rwxr-xr-x  1 ss  staff   7.2M Aug 26 06:12

For mold, it took 532.151s to generate all the binaries
For ld, it took 424.549s to generate all the binaries
For zld (https://github.com/michaeleisel/zld), it took 349.012 to generate all the binaries
I only repeat the experiment twice in my macbook locally with some other applications running, so the stats here may vary

I used the following setup for the build:

cmake 3.24.1
ninja 1.11.0
-j 10 when doing cmake build. From the activity monitor, I can see there are at most 10 mold processes when running the build, each consuming ~50%~350% CPU.

Could you please help what I may be doing wrong here? Let me know if more information is needed. Thanks.

Aug 25 '22 22:08 niyue

Could you add the -Wl,-perf flag to the linker command line? It should print out details of performance numbers. I think there's some slow pass among our internal passes.

Aug 25 '22 23:08 rui314

I added -Wl,-perf flag and did some experiments, and here is what I found:

previously in my test, I build the entire project with many binaries (cmake targets, there are 43 binaries as I said above). When I turned on the perf flag, it shows number like below for one of the binary that got linked slower compared to ld (let us call this binary target T1):

  44.921    1.137   37.195  all
    0.088    0.019    0.859    read_input_files
   43.448    0.695   34.436    parse_object_files
    0.064    0.126    0.153    resolve_symbols
    0.019    0.025    0.030    remove_unreferenced_subsections
    0.000    0.000    0.000    handle_exported_symbols_list
    0.000    0.000    0.000    handle_unexported_symbols_list
    0.001    0.000    0.001    claim_unresolved_symbols
    0.011    0.004    0.094    create_synthetic_chunks
    0.036    0.052    0.087    merge_cstring_sections
    0.007    0.000    0.004      uniquify_cstrings
    0.003    0.000    0.004    export_symbols
    0.293    0.036    0.426    assign_offsets
    0.087    0.024    0.370      __TEXT
    0.000    0.000    0.000        __mach_header
    0.003    0.000    0.018        __text
    0.000    0.000    0.000        __stubs
    0.000    0.000    0.000        __stub_helper
    0.000    0.000    0.000        __gcc_except_tab
    0.000    0.000    0.000        __cstring
    0.083    0.023    0.351        __unwind_info
    0.000    0.000    0.000        __const
    0.000    0.000    0.000        __eh_frame
    0.000    0.000    0.000        __literal16
    0.000    0.000    0.000        __literal4
    0.000    0.000    0.000        __literal8
    0.000    0.000    0.000      __DATA_CONST
    0.000    0.000    0.000        __got
    0.000    0.000    0.000        __const
    0.000    0.000    0.000        __mod_init_func
    0.000    0.000    0.000      __DATA
    0.000    0.000    0.000        __la_symbol_ptr
    0.000    0.000    0.000        __thread_ptrs
    0.000    0.000    0.000        __data
    0.000    0.000    0.000        __thread_data
    0.000    0.000    0.000        __thread_vars
    0.206    0.012    0.055      __LINKEDIT
    0.196    0.011    0.044        __rebase
    0.184    0.009    0.029        __func_starts
    0.002    0.000    0.000        __lazy_binding
    0.144    0.004    0.017        __binding
    0.206    0.012    0.053        __export
    0.155    0.004    0.019        __symbol_table
    0.000    0.000    0.000        __data_in_code
    0.000    0.000    0.000        __string_table
    0.027    0.072    0.389    open_file
    0.430    0.056    0.523    copy_sections_to_output_file
    0.430    0.056    0.523      __TEXT
    0.000    0.000    0.000        __mach_header
    0.124    0.034    0.056        __text
    0.184    0.049    0.441        __unwind_info
    0.000    0.000    0.000        __stub_helper
    0.010    0.004    0.006        __gcc_except_tab
    0.003    0.002    0.002        __literal16
    0.000    0.000    0.000        __stubs
    0.003    0.001    0.001        __cstring
    0.000    0.000    0.000        __literal4
    0.000    0.000    0.000        __literal8
    0.009    0.001    0.002        __const
    0.066    0.024    0.026        __eh_frame
    0.018    0.001    0.003      __DATA
    0.000    0.000    0.000        __la_symbol_ptr
    0.000    0.000    0.000        __thread_ptrs
    0.018    0.001    0.003        __data
    0.000    0.000    0.000        __thread_vars
    0.001    0.000    0.000        __thread_bss
    0.000    0.000    0.000        __thread_data
    0.013    0.001    0.002      __DATA_CONST
    0.002    0.000    0.001        __got
    0.007    0.001    0.001        __const
    0.004    0.000    0.001        __mod_init_func
    0.356    0.046    0.132      __LINKEDIT
    0.000    0.000    0.000        __rebase
    0.001    0.000    0.000        __binding
    0.002    0.000    0.001        __func_starts
    0.000    0.000    0.000        __lazy_binding
    0.187    0.002    0.017        __export
    0.353    0.045    0.125        __symbol_table
    0.000    0.000    0.000        __string_table
    0.000    0.000    0.000        __data_in_code
    0.455    0.002    0.046    copy_sections_to_output_file
    0.001    0.047    0.077    close_file

I would like to make comparison more straightforward, so instead of building the entire project, I specify the single target T1 to cmake and asked cmake to link only this target. I found the perf stat printed when I build single target is faster compared to the same target's perf stat when built the entire project (the stat above), and here is the stat if I only build this single target T1:

 User   System     Real  Name
   23.687    0.470    3.485  all
    0.046    0.016    0.062    read_input_files
   22.398    0.355    3.114    parse_object_files
    0.108    0.001    0.013    resolve_symbols
    0.032    0.000    0.003    remove_unreferenced_subsections
    0.000    0.000    0.000    handle_exported_symbols_list
    0.000    0.000    0.000    handle_unexported_symbols_list
    0.001    0.000    0.000    claim_unresolved_symbols
    0.011    0.001    0.006    create_synthetic_chunks
    0.040    0.001    0.007    merge_cstring_sections
    0.005    0.001    0.001      uniquify_cstrings
    0.022    0.000    0.002    export_symbols
    0.212    0.022    0.071    assign_offsets
    0.046    0.011    0.056      __TEXT
    0.000    0.000    0.000        __mach_header
    0.002    0.000    0.002        __text
    0.000    0.000    0.000        __stubs
    0.000    0.000    0.000        __stub_helper
    0.000    0.000    0.000        __gcc_except_tab
    0.000    0.000    0.000        __cstring
    0.043    0.011    0.054        __unwind_info
    0.000    0.000    0.000        __const
    0.000    0.000    0.000        __eh_frame
    0.000    0.000    0.000        __literal16
    0.000    0.000    0.000        __literal4
    0.000    0.000    0.000        __literal8
    0.000    0.000    0.000      __DATA_CONST
    0.000    0.000    0.000        __got
    0.000    0.000    0.000        __const
    0.000    0.000    0.000        __mod_init_func
    0.000    0.000    0.000      __DATA
    0.000    0.000    0.000        __la_symbol_ptr
    0.000    0.000    0.000        __thread_ptrs
    0.000    0.000    0.000        __data
    0.000    0.000    0.000        __thread_data
    0.000    0.000    0.000        __thread_vars
    0.167    0.011    0.015      __LINKEDIT
    0.126    0.006    0.011        __rebase
    0.139    0.008    0.011        __func_starts
    0.001    0.000    0.000        __lazy_binding
    0.127    0.006    0.011        __binding
    0.077    0.002    0.006        __symbol_table
    0.000    0.000    0.000        __data_in_code
    0.167    0.011    0.014        __export
    0.000    0.000    0.000        __string_table
    0.021    0.026    0.047    open_file
    0.350    0.030    0.083    copy_sections_to_output_file
    0.350    0.029    0.083      __TEXT
    0.000    0.000    0.000        __mach_header
    0.120    0.013    0.010        __text
    0.200    0.028    0.069        __unwind_info
    0.002    0.000    0.000        __literal16
    0.000    0.000    0.000        __stub_helper
    0.005    0.001    0.001        __const
    0.025    0.003    0.002        __gcc_except_tab
    0.009    0.001    0.001        __cstring
    0.118    0.012    0.009        __eh_frame
    0.001    0.000    0.000        __literal4
    0.000    0.000    0.000        __stubs
    0.000    0.000    0.000        __literal8
    0.002    0.000    0.001      __DATA
    0.000    0.000    0.000        __la_symbol_ptr
    0.000    0.000    0.000        __thread_ptrs
    0.002    0.000    0.000        __data
    0.000    0.000    0.000        __thread_vars
    0.000    0.000    0.000        __thread_bss
    0.000    0.000    0.000        __thread_data
    0.040    0.001    0.007      __DATA_CONST
    0.004    0.000    0.001        __got
    0.006    0.000    0.001        __const
    0.004    0.000    0.001        __mod_init_func
    0.316    0.025    0.044      __LINKEDIT
    0.000    0.000    0.000        __rebase
    0.001    0.000    0.000        __func_starts
    0.000    0.000    0.000        __binding
    0.316    0.025    0.044        __symbol_table
    0.000    0.000    0.000        __lazy_binding
    0.144    0.001    0.011        __export
    0.000    0.000    0.000        __data_in_code
    0.000    0.000    0.000        __string_table
    0.393    0.001    0.033    copy_sections_to_output_file
    0.000    0.017    0.027    close_file

I did some more tests with single target linking, and depending on different targets, mold is 1x~3x speed of ld.

So it seems the two cases, mold performance varies (even for linking the same binary) for some reason.

cmake building multiple targets
cmake building only one target

I will do some more tests to see if I can find something. At the same time, do you have any idea what could go wrong here?

Aug 27 '22 11:08 niyue

Thanks! So it looks like parse_object_files dominates, which is a good news because it's obvious that we are doing something weird in that pass. My wild guess is that the command line contains duplicate files (such as the same .so or .a) and we are parsing the same file again and again.

Aug 28 '22 07:08 rui314

[macOS] performance issue - slower than ld