rustc_codegen_cranelift
rustc_codegen_cranelift copied to clipboard
Improve stack optimization pass
- [x] Fold
stack_addrintoload/store. - [x] Remove
stack_{addr,load}with unused return value. - [ ] Perform store to load forwarding when
stack_addris not used on a stack slot.- [x] Single ebb store to load forwarding
- [x] Cross ebb store to load forwarding
- [ ] Store to load forwarding with are multiple stores, but is always after others and before the load
- [ ] Store to load forwarding with phi's
- [ ] Remove redundant
stack_store. (nostack_loadbetween current and nextstack_storeand nostack_addrbefore currentstack_store) - [ ] Fold
stack_loadintobitcastwhenstack_loadis only used by thatbitcast.
WIP implementation at https://github.com/bjorn3/rustc_codegen_cranelift/tree/opt_stack2reg
Edit: merged
Perf as of bc1db13a0216d41c1e9ce12d6d64e2c9bb979f66:
Benchmark #1: simple-raytracer/raytracer_cg_clif
Time (mean ± σ): 8.744 s ± 0.240 s [User: 8.625 s, System: 0.050 s]
Range (min … max): 8.354 s … 9.172 s 20 runs
Benchmark #2: simple-raytracer/raytracer_cg_clif_no_opt
Time (mean ± σ): 9.148 s ± 0.181 s [User: 9.020 s, System: 0.052 s]
Range (min … max): 8.823 s … 9.420 s 20 runs
Benchmark #3: simple-raytracer/raytracer_cg_llvm
Time (mean ± σ): 6.501 s ± 0.043 s [User: 6.463 s, System: 0.014 s]
Range (min … max): 6.470 s … 6.639 s 20 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
'simple-raytracer/raytracer_cg_llvm' ran
1.34 ± 0.04 times faster than 'simple-raytracer/raytracer_cg_clif'
1.41 ± 0.03 times faster than 'simple-raytracer/raytracer_cg_clif_no_opt'
Perf as of 5cb24bca75d2f7b0a00553a7cda45315c8d6502a (with all println! and the format! in optimize/mod.rs removed):
Benchmark #1: ../cargo.sh build
Time (mean ± σ): 16.020 s ± 0.597 s [User: 27.250 s, System: 4.139 s]
Range (min … max): 15.318 s … 17.193 s 10 runs
Benchmark #2: RUSTFLAGS='' cargo build --target x86_64-apple-darwin
Time (mean ± σ): 14.638 s ± 0.807 s [User: 44.183 s, System: 3.487 s]
Range (min … max): 13.691 s … 15.964 s 10 runs
Summary
'RUSTFLAGS='' cargo build --target x86_64-apple-darwin' ran
1.09 ± 0.07 times faster than '../cargo.sh build'
[BENCH RUN] ebobby/simple-raytracer
Benchmark #1: ./raytracer_cg_clif_no_opt
Time (mean ± σ): 8.866 s ± 0.213 s [User: 8.787 s, System: 0.032 s]
Range (min … max): 8.538 s … 9.277 s 20 runs
Benchmark #2: ./raytracer_cg_clif_stack_opt
Time (mean ± σ): 7.337 s ± 0.277 s [User: 7.292 s, System: 0.020 s]
Range (min … max): 7.251 s … 8.514 s 20 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark #3: ./raytracer_cg_clif_stack_opt_scalar_pair_copy_split
Time (mean ± σ): 6.725 s ± 0.188 s [User: 6.623 s, System: 0.037 s]
Range (min … max): 6.438 s … 6.993 s 20 runs
Benchmark #4: ./raytracer_cg_llvm
Time (mean ± σ): 6.638 s ± 0.140 s [User: 6.568 s, System: 0.030 s]
Range (min … max): 6.424 s … 6.949 s 20 runs
Summary
'./raytracer_cg_llvm' ran
1.01 ± 0.04 times faster than './raytracer_cg_clif_stack_opt_scalar_pair_copy_split'
1.11 ± 0.05 times faster than './raytracer_cg_clif_stack_opt'
1.34 ± 0.04 times faster than './raytracer_cg_clif_no_opt'
Almost as fast as llvm in debug mode without any inlining performed by cg_clif.
#853 has been merged. I will leave this issue open to track improvements.
Note to self: watch https://youtube.be/9OIA7DTFQWU again. One of the things @sunfishcode talks about is how memory optimizations could be done.
Copying from #856:
https://llvm.org/docs/MemorySSA.html https://www.airs.com/dnovillo/Papers/mem-ssa.pdf
a793be8ee8895538e99acc2a855d9c4ae145fc78 removed the old broken (#1142) stack optimization pass. It will need to be written from scratch at some point.