rustc_codegen_cranelift icon indicating copy to clipboard operation
rustc_codegen_cranelift copied to clipboard

Use multiple threads for codegen

Open bjorn3 opened this issue 6 years ago • 2 comments

This will need parallel rustc and the cranelift module interface to be Sync.

bjorn3 avatar Aug 01 '19 11:08 bjorn3

It should be possible to do this without rustc and cranelift changes by batching all cranelift ir in a cgu together and hand them and the Module to a background thread for compilation. The main thread can then process the next cgu.

bjorn3 avatar Dec 07 '21 23:12 bjorn3

Working on this over at https://github.com/bjorn3/rustc_codegen_cranelift/tree/parallel_comp_refactor

bjorn3 avatar Aug 10 '22 18:08 bjorn3

Some preliminary results to show the benefits of parallelization. Note that debuginfo emission is currently disabled as it is requires TyCtxt and requires extensive refactorings to remove this dependency.

dev-desktop.infra.rust-lang.org (AMD EPYC 7R32, virtualized, HT, 32 threads out of 96 available to the VM)

This is a beefy machine provided by the Rust Foundation. Lack of parallelization disadvantags cg_clif on this machine so much that it is slower than cg_llvm.

Before (523f0db7dbfdc9d8d5644accaf536902cbf62a4a)

Benchmark 1: RUSTFLAGS='' cargo build
  Time (mean ± σ):      6.843 s ±  0.037 s    [User: 24.153 s, System: 4.386 s]
  Range (min … max):    6.791 s …  6.894 s    10 runs
 
Benchmark 2: "/home/gh-bjorn3/cg_clif/build/cargo-clif" "build"
  Time (mean ± σ):      8.179 s ±  0.239 s    [User: 16.630 s, System: 3.347 s]
  Range (min … max):    8.032 s …  8.785 s    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Summary
  'RUSTFLAGS='' cargo build' ran
    1.20 ± 0.04 times faster than '"/home/gh-bjorn3/cg_clif/build/cargo-clif" "build"'

After (575f4b81ee49ecf3c92d06d5568172df2fb5aa6d)

Benchmark 1: RUSTFLAGS='' cargo build
  Time (mean ± σ):      6.870 s ±  0.067 s    [User: 24.296 s, System: 4.286 s]
  Range (min … max):    6.807 s …  6.989 s    10 runs
 
Benchmark 2: "/home/gh-bjorn3/cg_clif/build/cargo-clif" "build"
  Time (mean ± σ):      6.258 s ±  0.036 s    [User: 16.206 s, System: 3.406 s]
  Range (min … max):    6.224 s …  6.343 s    10 runs
 
Summary
  '"/home/gh-bjorn3/cg_clif/build/cargo-clif" "build"' ran
    1.10 ± 0.01 times faster than 'RUSTFLAGS='' cargo build'

personal laptop (Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz, HT, 4 threads total)

This is the system I have almost exclusively developed cg_clif before I got access to dev-desktop.infra.rust-lang.org. It doesn't have a lot of cores, so even without parallelization the cpu would be kept busy when compiling simple-raytracer due to enough rustc instances being spawned.

Before (523f0db7dbfdc9d8d5644accaf536902cbf62a4a)

Benchmark 1: RUSTFLAGS='' cargo build
  Time (mean ± σ):     17.986 s ±  1.295 s    [User: 50.186 s, System: 5.098 s]
  Range (min … max):   17.277 s … 21.542 s    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: "/home/bjorn/Projects/cg_clif2/build/cargo-clif" "build"
  Time (mean ± σ):     14.986 s ±  0.323 s    [User: 29.853 s, System: 4.491 s]
  Range (min … max):   14.465 s … 15.550 s    10 runs
 
Summary
  '"/home/bjorn/Projects/cg_clif2/build/cargo-clif" "build"' ran
    1.20 ± 0.09 times faster than 'RUSTFLAGS='' cargo build'

After (9750231076b2f2a40497876ac5e6f6eb36bcbde2)

These results are with a patch (9750231076b2f2a40497876ac5e6f6eb36bcbde2) applied that works around a deadlock when there are as many rustc instances as cpu cores due to the jobserver implicit token not being accounted for. The fix causes more threads to be spawned than there are cores as side effect, which causes contention. This may have negatively affected the results a bit.

  Time (mean ± σ):     17.491 s ±  0.200 s    [User: 49.201 s, System: 5.063 s]
  Range (min … max):   17.255 s … 17.840 s    10 runs
 
Benchmark 2: "/home/bjorn/Projects/cg_clif2/build/cargo-clif" "build"
  Time (mean ± σ):     12.362 s ±  0.067 s    [User: 30.262 s, System: 4.344 s]
  Range (min … max):   12.270 s … 12.486 s    10 runs
 
Summary
  '"/home/bjorn/Projects/cg_clif2/build/cargo-clif" "build"' ran
    1.41 ± 0.02 times faster than 'RUSTFLAGS='' cargo build'

bjorn3 avatar Aug 13 '22 19:08 bjorn3

Wall time results for rustc-perf as of 08f5c99170712d639b15d9ba7d75ec90881b6964 (which is part of a wip branch with parallel comp enabled). Almost entirely positive now with parallel compilation:

image

bjorn3 avatar Aug 19 '22 16:08 bjorn3

The main thing I'm trying to figure out right now is how to make jobserver behave. I'm either getting deadlocks if as much rustc instances are spawned as there are cpu cores if I ignore the implicit jobserver token, or too much rustc processes get spawned at the same time if I give up on the implicit token. I have to find some way to not give up on the implicit token but still use it to get work done.

bjorn3 avatar Aug 19 '22 16:08 bjorn3