mold
mold copied to clipboard
[macOS] performance issue - slower than ld
I used the latest version of mold (694f973dc50d2d7f207849cf64f6cc1ddd00a987) on macOS 12.5, and compared its performance with Apple's built-in ld, and I find its performance not as good as I expected, in my non scientific benchmark, it is even slower than ld. Am I missing something to configure?
This is probably not a correct way to benchmark the linker, but here's how I did it currently:
- My project is cmake based, I compiled and linked all the programs and generated the binaries using cmake
- I deleted all the generated binary and ran the cmake build again, which will ask cmake to generate all the binaries again. Since all the other build artifacts remained, the only thing needs to be generated are the binaries, and all programs will be linked and generated.
In my current setup, each build will generate 43 binaries, with size like this:
-rwxr-xr-x 1 ss staff 114M Aug 26 06:10
-rwxr-xr-x 1 ss staff 159M Aug 26 06:11
-rwxr-xr-x 1 ss staff 27M Aug 26 06:10
-rwxr-xr-x 1 ss staff 123M Aug 26 06:10
-rwxr-xr-x 1 ss staff 133M Aug 26 06:14
-rwxr-xr-x 1 ss staff 143M Aug 26 06:14
-rwxr-xr-x 1 ss staff 12M Aug 26 06:08
-rwxr-xr-x 1 ss staff 13M Aug 26 06:08
-rwxr-xr-x 1 ss staff 67M Aug 26 06:14
-rwxr-xr-x 1 ss staff 245M Aug 26 06:10
-rwxr-xr-x 1 ss staff 50M Aug 26 06:12
-rwxr-xr-x 1 ss staff 266M Aug 26 06:14
-rwxr-xr-x 1 ss staff 49M Aug 26 06:13
-rwxr-xr-x 1 ss staff 115M Aug 26 06:15
-rwxr-xr-x 1 ss staff 41M Aug 26 06:14
-rwxr-xr-x 1 ss staff 10M Aug 26 06:08
-rwxr-xr-x 1 ss staff 13M Aug 26 06:12
-rwxr-xr-x 1 ss staff 248M Aug 26 06:15
-rwxr-xr-x 1 ss staff 35M Aug 26 06:11
-rwxr-xr-x 1 ss staff 57M Aug 26 06:12
-rwxr-xr-x 1 ss staff 12M Aug 26 06:13
-rwxr-xr-x 1 ss staff 11M Aug 26 06:10
-rwxr-xr-x 1 ss staff 651M Aug 26 06:11
-rwxr-xr-x 1 ss staff 141M Aug 26 06:15
-rwxr-xr-x 1 ss staff 11M Aug 26 06:14
-rwxr-xr-x 1 ss staff 763M Aug 26 06:11
-rwxr-xr-x 1 ss staff 87M Aug 26 06:12
-rwxr-xr-x 1 ss staff 643M Aug 26 06:15
-rwxr-xr-x 1 ss staff 11M Aug 26 06:13
-rwxr-xr-x 1 ss staff 94M Aug 26 06:14
-rwxr-xr-x 1 ss staff 70M Aug 26 06:14
-rwxr-xr-x 1 ss staff 166M Aug 26 06:14
-rwxr-xr-x 1 ss staff 249M Aug 26 06:14
-rwxr-xr-x 1 ss staff 11M Aug 26 06:15
-rwxr-xr-x 1 ss staff 80M Aug 26 06:12
-rwxr-xr-x 1 ss staff 16M Aug 26 06:08
-rwxr-xr-x 1 ss staff 18M Aug 26 06:08
-rwxr-xr-x 1 ss staff 20M Aug 26 06:08
-rwxr-xr-x 1 ss staff 16M Aug 26 06:12
-rwxr-xr-x 1 ss staff 2.3M Aug 26 06:12
-rwxr-xr-x 1 ss staff 9.6M Aug 26 06:14
-rwxr-xr-x 1 ss staff 7.3M Aug 26 06:08
-rwxr-xr-x 1 ss staff 7.2M Aug 26 06:12
- For
mold, it took532.151sto generate all the binaries - For
ld, it took424.549sto generate all the binaries - For
zld(https://github.com/michaeleisel/zld), it took349.012to generate all the binaries - I only repeat the experiment twice in my macbook locally with some other applications running, so the stats here may vary
I used the following setup for the build:
- cmake 3.24.1
- ninja 1.11.0
-j 10when doing cmake build. From the activity monitor, I can see there are at most 10moldprocesses when running the build, each consuming ~50%~350% CPU.
Could you please help what I may be doing wrong here? Let me know if more information is needed. Thanks.
Could you add the -Wl,-perf flag to the linker command line? It should print out details of performance numbers. I think there's some slow pass among our internal passes.
I added -Wl,-perf flag and did some experiments, and here is what I found:
- previously in my test, I build the entire project with many binaries (cmake targets, there are 43 binaries as I said above). When I turned on the perf flag, it shows number like below for one of the binary that got linked slower compared to
ld(let us call this binary targetT1):
44.921 1.137 37.195 all
0.088 0.019 0.859 read_input_files
43.448 0.695 34.436 parse_object_files
0.064 0.126 0.153 resolve_symbols
0.019 0.025 0.030 remove_unreferenced_subsections
0.000 0.000 0.000 handle_exported_symbols_list
0.000 0.000 0.000 handle_unexported_symbols_list
0.001 0.000 0.001 claim_unresolved_symbols
0.011 0.004 0.094 create_synthetic_chunks
0.036 0.052 0.087 merge_cstring_sections
0.007 0.000 0.004 uniquify_cstrings
0.003 0.000 0.004 export_symbols
0.293 0.036 0.426 assign_offsets
0.087 0.024 0.370 __TEXT
0.000 0.000 0.000 __mach_header
0.003 0.000 0.018 __text
0.000 0.000 0.000 __stubs
0.000 0.000 0.000 __stub_helper
0.000 0.000 0.000 __gcc_except_tab
0.000 0.000 0.000 __cstring
0.083 0.023 0.351 __unwind_info
0.000 0.000 0.000 __const
0.000 0.000 0.000 __eh_frame
0.000 0.000 0.000 __literal16
0.000 0.000 0.000 __literal4
0.000 0.000 0.000 __literal8
0.000 0.000 0.000 __DATA_CONST
0.000 0.000 0.000 __got
0.000 0.000 0.000 __const
0.000 0.000 0.000 __mod_init_func
0.000 0.000 0.000 __DATA
0.000 0.000 0.000 __la_symbol_ptr
0.000 0.000 0.000 __thread_ptrs
0.000 0.000 0.000 __data
0.000 0.000 0.000 __thread_data
0.000 0.000 0.000 __thread_vars
0.206 0.012 0.055 __LINKEDIT
0.196 0.011 0.044 __rebase
0.184 0.009 0.029 __func_starts
0.002 0.000 0.000 __lazy_binding
0.144 0.004 0.017 __binding
0.206 0.012 0.053 __export
0.155 0.004 0.019 __symbol_table
0.000 0.000 0.000 __data_in_code
0.000 0.000 0.000 __string_table
0.027 0.072 0.389 open_file
0.430 0.056 0.523 copy_sections_to_output_file
0.430 0.056 0.523 __TEXT
0.000 0.000 0.000 __mach_header
0.124 0.034 0.056 __text
0.184 0.049 0.441 __unwind_info
0.000 0.000 0.000 __stub_helper
0.010 0.004 0.006 __gcc_except_tab
0.003 0.002 0.002 __literal16
0.000 0.000 0.000 __stubs
0.003 0.001 0.001 __cstring
0.000 0.000 0.000 __literal4
0.000 0.000 0.000 __literal8
0.009 0.001 0.002 __const
0.066 0.024 0.026 __eh_frame
0.018 0.001 0.003 __DATA
0.000 0.000 0.000 __la_symbol_ptr
0.000 0.000 0.000 __thread_ptrs
0.018 0.001 0.003 __data
0.000 0.000 0.000 __thread_vars
0.001 0.000 0.000 __thread_bss
0.000 0.000 0.000 __thread_data
0.013 0.001 0.002 __DATA_CONST
0.002 0.000 0.001 __got
0.007 0.001 0.001 __const
0.004 0.000 0.001 __mod_init_func
0.356 0.046 0.132 __LINKEDIT
0.000 0.000 0.000 __rebase
0.001 0.000 0.000 __binding
0.002 0.000 0.001 __func_starts
0.000 0.000 0.000 __lazy_binding
0.187 0.002 0.017 __export
0.353 0.045 0.125 __symbol_table
0.000 0.000 0.000 __string_table
0.000 0.000 0.000 __data_in_code
0.455 0.002 0.046 copy_sections_to_output_file
0.001 0.047 0.077 close_file
- I would like to make comparison more straightforward, so instead of building the entire project, I specify the single target
T1to cmake and asked cmake to link only this target. I found the perf stat printed when I build single target is faster compared to the same target's perf stat when built the entire project (the stat above), and here is the stat if I only build this single target T1:
User System Real Name
23.687 0.470 3.485 all
0.046 0.016 0.062 read_input_files
22.398 0.355 3.114 parse_object_files
0.108 0.001 0.013 resolve_symbols
0.032 0.000 0.003 remove_unreferenced_subsections
0.000 0.000 0.000 handle_exported_symbols_list
0.000 0.000 0.000 handle_unexported_symbols_list
0.001 0.000 0.000 claim_unresolved_symbols
0.011 0.001 0.006 create_synthetic_chunks
0.040 0.001 0.007 merge_cstring_sections
0.005 0.001 0.001 uniquify_cstrings
0.022 0.000 0.002 export_symbols
0.212 0.022 0.071 assign_offsets
0.046 0.011 0.056 __TEXT
0.000 0.000 0.000 __mach_header
0.002 0.000 0.002 __text
0.000 0.000 0.000 __stubs
0.000 0.000 0.000 __stub_helper
0.000 0.000 0.000 __gcc_except_tab
0.000 0.000 0.000 __cstring
0.043 0.011 0.054 __unwind_info
0.000 0.000 0.000 __const
0.000 0.000 0.000 __eh_frame
0.000 0.000 0.000 __literal16
0.000 0.000 0.000 __literal4
0.000 0.000 0.000 __literal8
0.000 0.000 0.000 __DATA_CONST
0.000 0.000 0.000 __got
0.000 0.000 0.000 __const
0.000 0.000 0.000 __mod_init_func
0.000 0.000 0.000 __DATA
0.000 0.000 0.000 __la_symbol_ptr
0.000 0.000 0.000 __thread_ptrs
0.000 0.000 0.000 __data
0.000 0.000 0.000 __thread_data
0.000 0.000 0.000 __thread_vars
0.167 0.011 0.015 __LINKEDIT
0.126 0.006 0.011 __rebase
0.139 0.008 0.011 __func_starts
0.001 0.000 0.000 __lazy_binding
0.127 0.006 0.011 __binding
0.077 0.002 0.006 __symbol_table
0.000 0.000 0.000 __data_in_code
0.167 0.011 0.014 __export
0.000 0.000 0.000 __string_table
0.021 0.026 0.047 open_file
0.350 0.030 0.083 copy_sections_to_output_file
0.350 0.029 0.083 __TEXT
0.000 0.000 0.000 __mach_header
0.120 0.013 0.010 __text
0.200 0.028 0.069 __unwind_info
0.002 0.000 0.000 __literal16
0.000 0.000 0.000 __stub_helper
0.005 0.001 0.001 __const
0.025 0.003 0.002 __gcc_except_tab
0.009 0.001 0.001 __cstring
0.118 0.012 0.009 __eh_frame
0.001 0.000 0.000 __literal4
0.000 0.000 0.000 __stubs
0.000 0.000 0.000 __literal8
0.002 0.000 0.001 __DATA
0.000 0.000 0.000 __la_symbol_ptr
0.000 0.000 0.000 __thread_ptrs
0.002 0.000 0.000 __data
0.000 0.000 0.000 __thread_vars
0.000 0.000 0.000 __thread_bss
0.000 0.000 0.000 __thread_data
0.040 0.001 0.007 __DATA_CONST
0.004 0.000 0.001 __got
0.006 0.000 0.001 __const
0.004 0.000 0.001 __mod_init_func
0.316 0.025 0.044 __LINKEDIT
0.000 0.000 0.000 __rebase
0.001 0.000 0.000 __func_starts
0.000 0.000 0.000 __binding
0.316 0.025 0.044 __symbol_table
0.000 0.000 0.000 __lazy_binding
0.144 0.001 0.011 __export
0.000 0.000 0.000 __data_in_code
0.000 0.000 0.000 __string_table
0.393 0.001 0.033 copy_sections_to_output_file
0.000 0.017 0.027 close_file
- I did some more tests with single target linking, and depending on different targets,
moldis 1x~3x speed ofld.
So it seems the two cases, mold performance varies (even for linking the same binary) for some reason.
- cmake building multiple targets
- cmake building only one target
I will do some more tests to see if I can find something. At the same time, do you have any idea what could go wrong here?
Thanks! So it looks like parse_object_files dominates, which is a good news because it's obvious that we are doing something weird in that pass. My wild guess is that the command line contains duplicate files (such as the same .so or .a) and we are parsing the same file again and again.