rust-gpu Inliner->`mem2reg` can quadratically amplify already-exponential inlining (turning seconds into minutes).

Inliner->`mem2reg` can quadratically amplify already-exponential inlining (turning seconds into minutes).

Open eddyb opened this issue 1 year ago • 1 comments

We've known for a while that the existing inliner+mem2reg setup can lead to long compile times (e.g. #851), but we've never had a simple stress test to measure that behavior and e.g. validate potential solutions against, only full projects.

Thankfully, @schell published a crabslab crate with a #[derive(...)] proc macro which generates what is effectively a "shallow deserialization from GPU buffer" method, and I was able to turn that into a benchmark:

	O(_ⁿ)	`n=4`	`n=5`	`n=6`	`n=7`	`n=8`	`n=9`
total	O(3.3ⁿ)	0.664	0.965	2.135	9.124	46.472	247.335
post-inline `mem2reg`	O(5.4ⁿ)	0.054	0.245	1.173	7.584	43.371	239.904
`spirv-opt`	O(2.2ⁿ)	0.081	0.169	0.351	0.767	1.783	4.397
inline	O(3.4ⁿ)	0	0.007	0.020	0.067	0.234	0.959
SPIR-V -> SPIR-T	O(2ⁿ)	0.005	0.008	0.014	0.032	0.071	0.167

If you don't mind the very rudimentary curve-fitting (also I've simplified the rows from the -Z time-passes names), what you should be able to notice is there are two trends:

~2ⁿ: the amount of SPIR-V generated (as observed by SPIR-V -> SPIR-T and spirv-opt)
- this is intended for this test: there should be 2ⁿ leaf calls generated and inlined
- the inliner itself should also fit here but it's not bottom-up so presumably has extra inefficiencies
  - while working on the fix, I saw the amount of debuginfo generated, that is likely a lot of the cost
>4ⁿ: post-inline mem2reg is at least (2ⁿ)², i.e. quadratic (or worse) in the amount of SPIR-V
- we more or less knew this, but this test is simple enough that it shouldn't have any mem2reg work left!

What happened? We forgot to switch the inliner over to OpPhis for its return value dataflow, so to this day it generates OpVariables (w/ OpStores replacing callee returns, and OpLoads at call site): https://github.com/EmbarkStudios/rust-gpu/blob/8678d58d61a78f01201ec854cb5e3835c014fa3b/crates/rustc_codegen_spirv/src/linker/inline.rs#L658-L664

Some quick hacky test (using OpUndef), for two known projects, ended up making mem2reg:

15x faster (~730s -> ~50s) for @hatoo's rene
- see also #851
- EDIT: initially this said "3x" but that was an invalid comparison and Rust-GPU main is really slow
30x faster (~150s -> ~5s) on @schell's more recent renderling (at https://github.com/schell/renderling/commit/d9f4d6fa485b238bc00ba01a6e1c7119319b5cbd)

_{(that is, if we fix this bug, it could bring some projects from minutes to seconds - for them, mem2reg was spinning its wheels that entire time, due to those OpVariables generated by the inliner, instead of actually helping)}

Since this is caused by the inliner itself, and we have to force-inline calls taking pointers into buffers (due to SPIR-V not allowing them to be passed to calls), I repro'd with just #[derive(Clone)] too:

	O(_ⁿ)	`n=4`	`n=5`	`n=6`	`n=7`	`n=8`	`n=9`
total	O(1.7ⁿ)	0.543	0.567	0.625	0.875	1.952	7.683
post-inline `mem2reg`	O(4.8ⁿ)	0	0.013	0.059	0.264	1.225	6.695
`spirv-opt`	O(1.9ⁿ)	0.009	0.012	0.022	0.046	0.096	0.204
inline	O(3ⁿ)	0	0	0	0.009	0.024	0.080
SPIR-V -> SPIR-T	O(1.7ⁿ)	0.003	0.004	0.007	0.010	0.019	0.047

That one is fast enough that it deserved more columns, but I'm not messing with jq/sorting any further.

There is, however, a very compact testcase that can be generated from it:

#[derive(Clone)]
pub struct D<T>(T, T);
type D4<T> = D<D<D<D<T>>>>;
type D12<T> = D4<D4<D4<T>>>;

#[spirv(fragment)]
pub fn fs_main(
    #[spirv(storage_buffer, descriptor_set = 0, binding = 0)] buf: &D12<f32>,
    out: &mut D12<f32>,
) {
    *out = buf.clone();
}

on main, it takes 692s (~11.5min) in mem2reg, and ~11.7s everywhere else
with the local hacky workaround, it's down to ~6.2s in total
- alright, that should be impossible, even the inlining is faster, the hack is doing too much
- then again, variables do require weirder handling, and the inliner isn't bottom-up, so maybe?
- either way, anywhere between 6 and 12 seconds should be possible with the true OpPhi fix

And if a 100x speedup isn't impressive enough (or 11-12 minutes not slow enough for a CI timeout), you can always bump it further: a type D13<T> = D<D12<T>>; should still take less than a minute once fixed, but anywhere from 45 minutes to a whole hour on main (I am not further delaying this issue just to prove that, though).

Feb 20 '24 10:02 eddyb

rust-gpu rust-gpu copied to clipboard

Inliner->`mem2reg` can quadratically amplify already-exponential inlining (turning seconds into minutes).

rust-gpu
rust-gpu copied to clipboard