rustc_codegen_cranelift icon indicating copy to clipboard operation
rustc_codegen_cranelift copied to clipboard

Improve stack optimization pass

Open bjorn3 opened this issue 5 years ago • 6 comments

  • [x] Fold stack_addr into load/store.
  • [x] Remove stack_{addr,load} with unused return value.
  • [ ] Perform store to load forwarding when stack_addr is not used on a stack slot.
    • [x] Single ebb store to load forwarding
    • [x] Cross ebb store to load forwarding
    • [ ] Store to load forwarding with are multiple stores, but is always after others and before the load
    • [ ] Store to load forwarding with phi's
  • [ ] Remove redundant stack_store. (no stack_load between current and next stack_store and no stack_addr before current stack_store)
  • [ ] Fold stack_load into bitcast when stack_load is only used by that bitcast.

WIP implementation at https://github.com/bjorn3/rustc_codegen_cranelift/tree/opt_stack2reg Edit: merged

bjorn3 avatar Dec 26 '19 13:12 bjorn3

Perf as of bc1db13a0216d41c1e9ce12d6d64e2c9bb979f66:

Benchmark #1: simple-raytracer/raytracer_cg_clif
  Time (mean ± σ):      8.744 s ±  0.240 s    [User: 8.625 s, System: 0.050 s]
  Range (min … max):    8.354 s …  9.172 s    20 runs
 
Benchmark #2: simple-raytracer/raytracer_cg_clif_no_opt
  Time (mean ± σ):      9.148 s ±  0.181 s    [User: 9.020 s, System: 0.052 s]
  Range (min … max):    8.823 s …  9.420 s    20 runs
 
Benchmark #3: simple-raytracer/raytracer_cg_llvm
  Time (mean ± σ):      6.501 s ±  0.043 s    [User: 6.463 s, System: 0.014 s]
  Range (min … max):    6.470 s …  6.639 s    20 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Summary
  'simple-raytracer/raytracer_cg_llvm' ran
    1.34 ± 0.04 times faster than 'simple-raytracer/raytracer_cg_clif'
    1.41 ± 0.03 times faster than 'simple-raytracer/raytracer_cg_clif_no_opt'

bjorn3 avatar Dec 28 '19 12:12 bjorn3

Perf as of 5cb24bca75d2f7b0a00553a7cda45315c8d6502a (with all println! and the format! in optimize/mod.rs removed):

Benchmark #1: ../cargo.sh build
  Time (mean ± σ):     16.020 s ±  0.597 s    [User: 27.250 s, System: 4.139 s]
  Range (min … max):   15.318 s … 17.193 s    10 runs
 
Benchmark #2: RUSTFLAGS='' cargo build --target x86_64-apple-darwin
  Time (mean ± σ):     14.638 s ±  0.807 s    [User: 44.183 s, System: 3.487 s]
  Range (min … max):   13.691 s … 15.964 s    10 runs
 
Summary
  'RUSTFLAGS='' cargo build --target x86_64-apple-darwin' ran
    1.09 ± 0.07 times faster than '../cargo.sh build'
[BENCH RUN] ebobby/simple-raytracer
Benchmark #1: ./raytracer_cg_clif_no_opt
  Time (mean ± σ):      8.866 s ±  0.213 s    [User: 8.787 s, System: 0.032 s]
  Range (min … max):    8.538 s …  9.277 s    20 runs
 
Benchmark #2: ./raytracer_cg_clif_stack_opt
  Time (mean ± σ):      7.337 s ±  0.277 s    [User: 7.292 s, System: 0.020 s]
  Range (min … max):    7.251 s …  8.514 s    20 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark #3: ./raytracer_cg_clif_stack_opt_scalar_pair_copy_split
  Time (mean ± σ):      6.725 s ±  0.188 s    [User: 6.623 s, System: 0.037 s]
  Range (min … max):    6.438 s …  6.993 s    20 runs
 
Benchmark #4: ./raytracer_cg_llvm
  Time (mean ± σ):      6.638 s ±  0.140 s    [User: 6.568 s, System: 0.030 s]
  Range (min … max):    6.424 s …  6.949 s    20 runs
 
Summary
  './raytracer_cg_llvm' ran
    1.01 ± 0.04 times faster than './raytracer_cg_clif_stack_opt_scalar_pair_copy_split'
    1.11 ± 0.05 times faster than './raytracer_cg_clif_stack_opt'
    1.34 ± 0.04 times faster than './raytracer_cg_clif_no_opt'

Almost as fast as llvm in debug mode without any inlining performed by cg_clif.

bjorn3 avatar Dec 31 '19 12:12 bjorn3

#853 has been merged. I will leave this issue open to track improvements.

bjorn3 avatar Jan 04 '20 11:01 bjorn3

Note to self: watch https://youtube.be/9OIA7DTFQWU again. One of the things @sunfishcode talks about is how memory optimizations could be done.

bjorn3 avatar Jan 11 '20 21:01 bjorn3

Copying from #856:

https://llvm.org/docs/MemorySSA.html https://www.airs.com/dnovillo/Papers/mem-ssa.pdf

bjorn3 avatar Mar 31 '21 10:03 bjorn3

a793be8ee8895538e99acc2a855d9c4ae145fc78 removed the old broken (#1142) stack optimization pass. It will need to be written from scratch at some point.

bjorn3 avatar Mar 31 '21 10:03 bjorn3