rustc_codegen_cranelift
                                
                                 rustc_codegen_cranelift copied to clipboard
                                
                                    rustc_codegen_cranelift copied to clipboard
                            
                            
                            
                        Use multiple threads for codegen
This will need parallel rustc and the cranelift module interface to be Sync.
It should be possible to do this without rustc and cranelift changes by batching all cranelift ir in a cgu together and hand them and the Module to a background thread for compilation. The main thread can then process the next cgu.
Working on this over at https://github.com/bjorn3/rustc_codegen_cranelift/tree/parallel_comp_refactor
Some preliminary results to show the benefits of parallelization. Note that debuginfo emission is currently disabled as it is requires TyCtxt and requires extensive refactorings to remove this dependency.
dev-desktop.infra.rust-lang.org (AMD EPYC 7R32, virtualized, HT, 32 threads out of 96 available to the VM)
This is a beefy machine provided by the Rust Foundation. Lack of parallelization disadvantags cg_clif on this machine so much that it is slower than cg_llvm.
Before (523f0db7dbfdc9d8d5644accaf536902cbf62a4a)
Benchmark 1: RUSTFLAGS='' cargo build
  Time (mean ± σ):      6.843 s ±  0.037 s    [User: 24.153 s, System: 4.386 s]
  Range (min … max):    6.791 s …  6.894 s    10 runs
 
Benchmark 2: "/home/gh-bjorn3/cg_clif/build/cargo-clif" "build"
  Time (mean ± σ):      8.179 s ±  0.239 s    [User: 16.630 s, System: 3.347 s]
  Range (min … max):    8.032 s …  8.785 s    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Summary
  'RUSTFLAGS='' cargo build' ran
    1.20 ± 0.04 times faster than '"/home/gh-bjorn3/cg_clif/build/cargo-clif" "build"'
After (575f4b81ee49ecf3c92d06d5568172df2fb5aa6d)
Benchmark 1: RUSTFLAGS='' cargo build
  Time (mean ± σ):      6.870 s ±  0.067 s    [User: 24.296 s, System: 4.286 s]
  Range (min … max):    6.807 s …  6.989 s    10 runs
 
Benchmark 2: "/home/gh-bjorn3/cg_clif/build/cargo-clif" "build"
  Time (mean ± σ):      6.258 s ±  0.036 s    [User: 16.206 s, System: 3.406 s]
  Range (min … max):    6.224 s …  6.343 s    10 runs
 
Summary
  '"/home/gh-bjorn3/cg_clif/build/cargo-clif" "build"' ran
    1.10 ± 0.01 times faster than 'RUSTFLAGS='' cargo build'
personal laptop (Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz, HT, 4 threads total)
This is the system I have almost exclusively developed cg_clif before I got access to dev-desktop.infra.rust-lang.org. It doesn't have a lot of cores, so even without parallelization the cpu would be kept busy when compiling simple-raytracer due to enough rustc instances being spawned.
Before (523f0db7dbfdc9d8d5644accaf536902cbf62a4a)
Benchmark 1: RUSTFLAGS='' cargo build
  Time (mean ± σ):     17.986 s ±  1.295 s    [User: 50.186 s, System: 5.098 s]
  Range (min … max):   17.277 s … 21.542 s    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: "/home/bjorn/Projects/cg_clif2/build/cargo-clif" "build"
  Time (mean ± σ):     14.986 s ±  0.323 s    [User: 29.853 s, System: 4.491 s]
  Range (min … max):   14.465 s … 15.550 s    10 runs
 
Summary
  '"/home/bjorn/Projects/cg_clif2/build/cargo-clif" "build"' ran
    1.20 ± 0.09 times faster than 'RUSTFLAGS='' cargo build'
After (9750231076b2f2a40497876ac5e6f6eb36bcbde2)
These results are with a patch (9750231076b2f2a40497876ac5e6f6eb36bcbde2) applied that works around a deadlock when there are as many rustc instances as cpu cores due to the jobserver implicit token not being accounted for. The fix causes more threads to be spawned than there are cores as side effect, which causes contention. This may have negatively affected the results a bit.
  Time (mean ± σ):     17.491 s ±  0.200 s    [User: 49.201 s, System: 5.063 s]
  Range (min … max):   17.255 s … 17.840 s    10 runs
 
Benchmark 2: "/home/bjorn/Projects/cg_clif2/build/cargo-clif" "build"
  Time (mean ± σ):     12.362 s ±  0.067 s    [User: 30.262 s, System: 4.344 s]
  Range (min … max):   12.270 s … 12.486 s    10 runs
 
Summary
  '"/home/bjorn/Projects/cg_clif2/build/cargo-clif" "build"' ran
    1.41 ± 0.02 times faster than 'RUSTFLAGS='' cargo build'
Wall time results for rustc-perf as of 08f5c99170712d639b15d9ba7d75ec90881b6964 (which is part of a wip branch with parallel comp enabled). Almost entirely positive now with parallel compilation:

The main thing I'm trying to figure out right now is how to make jobserver behave. I'm either getting deadlocks if as much rustc instances are spawned as there are cpu cores if I ignore the implicit jobserver token, or too much rustc processes get spawned at the same time if I give up on the implicit token. I have to find some way to not give up on the implicit token but still use it to get work done.