rust-gpu
rust-gpu copied to clipboard
Inliner->`mem2reg` can quadratically amplify already-exponential inlining (turning seconds into minutes).
We've known for a while that the existing inliner+mem2reg setup can lead to long compile times (e.g. #851), but we've never had a simple stress test to measure that behavior and e.g. validate potential solutions against, only full projects.
Thankfully, @schell published a crabslab crate with a #[derive(...)] proc macro which generates what is effectively a "shallow deserialization from GPU buffer" method, and I was able to turn that into a benchmark:
| O(_ⁿ) | n=4 |
n=5 |
n=6 |
n=7 |
n=8 |
n=9 |
|
|---|---|---|---|---|---|---|---|
| total | O(3.3ⁿ) | 0.664 | 0.965 | 2.135 | 9.124 | 46.472 | 247.335 |
post-inline mem2reg |
O(5.4ⁿ) | 0.054 | 0.245 | 1.173 | 7.584 | 43.371 | 239.904 |
spirv-opt |
O(2.2ⁿ) | 0.081 | 0.169 | 0.351 | 0.767 | 1.783 | 4.397 |
| inline | O(3.4ⁿ) | 0 | 0.007 | 0.020 | 0.067 | 0.234 | 0.959 |
| SPIR-V -> SPIR-T | O(2ⁿ) | 0.005 | 0.008 | 0.014 | 0.032 | 0.071 | 0.167 |
If you don't mind the very rudimentary curve-fitting (also I've simplified the rows from the -Z time-passes names), what you should be able to notice is there are two trends:
~2ⁿ: the amount of SPIR-V generated (as observed bySPIR-V -> SPIR-Tandspirv-opt)- this is intended for this test: there should be 2ⁿ leaf calls generated and inlined
- the inliner itself should also fit here but it's not bottom-up so presumably has extra inefficiencies
- while working on the fix, I saw the amount of debuginfo generated, that is likely a lot of the cost
>4ⁿ: post-inlinemem2regis at least(2ⁿ)², i.e. quadratic (or worse) in the amount of SPIR-V- we more or less knew this, but this test is simple enough that it shouldn't have any
mem2regwork left!
- we more or less knew this, but this test is simple enough that it shouldn't have any
What happened? We forgot to switch the inliner over to OpPhis for its return value dataflow, so to this day it generates OpVariables (w/ OpStores replacing callee returns, and OpLoads at call site):
https://github.com/EmbarkStudios/rust-gpu/blob/8678d58d61a78f01201ec854cb5e3835c014fa3b/crates/rustc_codegen_spirv/src/linker/inline.rs#L658-L664
Some quick hacky test (using OpUndef), for two known projects, ended up making mem2reg:
- 15x faster (~
730s-> ~50s) for @hatoo'srene- see also #851
- EDIT: initially this said "3x" but that was an invalid comparison and Rust-GPU
mainis really slow
- 30x faster (~
150s-> ~5s) on @schell's more recentrenderling(at https://github.com/schell/renderling/commit/d9f4d6fa485b238bc00ba01a6e1c7119319b5cbd)
(that is, if we fix this bug, it could bring some projects from minutes to seconds - for them, mem2reg was spinning its wheels that entire time, due to those OpVariables generated by the inliner, instead of actually helping)
Since this is caused by the inliner itself, and we have to force-inline calls taking pointers into buffers (due to SPIR-V not allowing them to be passed to calls), I repro'd with just #[derive(Clone)] too:
| O(_ⁿ) | n=4 |
n=5 |
n=6 |
n=7 |
n=8 |
n=9 |
|
|---|---|---|---|---|---|---|---|
| total | O(1.7ⁿ) | 0.543 | 0.567 | 0.625 | 0.875 | 1.952 | 7.683 |
post-inline mem2reg |
O(4.8ⁿ) | 0 | 0.013 | 0.059 | 0.264 | 1.225 | 6.695 |
spirv-opt |
O(1.9ⁿ) | 0.009 | 0.012 | 0.022 | 0.046 | 0.096 | 0.204 |
| inline | O(3ⁿ) | 0 | 0 | 0 | 0.009 | 0.024 | 0.080 |
| SPIR-V -> SPIR-T | O(1.7ⁿ) | 0.003 | 0.004 | 0.007 | 0.010 | 0.019 | 0.047 |
That one is fast enough that it deserved more columns, but I'm not messing with jq/sorting any further.
There is, however, a very compact testcase that can be generated from it:
#[derive(Clone)]
pub struct D<T>(T, T);
type D4<T> = D<D<D<D<T>>>>;
type D12<T> = D4<D4<D4<T>>>;
#[spirv(fragment)]
pub fn fs_main(
#[spirv(storage_buffer, descriptor_set = 0, binding = 0)] buf: &D12<f32>,
out: &mut D12<f32>,
) {
*out = buf.clone();
}
- on
main, it takes692s(~11.5min) inmem2reg, and ~11.7severywhere else - with the local hacky workaround, it's down to ~
6.2sin total- alright, that should be impossible, even the inlining is faster, the hack is doing too much
- then again, variables do require weirder handling, and the inliner isn't bottom-up, so maybe?
- either way, anywhere between 6 and 12 seconds should be possible with the true
OpPhifix
And if a 100x speedup isn't impressive enough (or 11-12 minutes not slow enough for a CI timeout), you can always bump it further: a type D13<T> = D<D12<T>>; should still take less than a minute once fixed, but anywhere from 45 minutes to a whole hour on main (I am not further delaying this issue just to prove that, though).