multiversion
multiversion copied to clipboard
Dispatch to native vector width
This is a continuation of the "dynamic dispatch" discussion of https://github.com/rust-lang/portable-simd/issues/218 , which we all agreed there best belongs here.
Basically, I think the current multiversion API has two unfortunate limitations:
- The user needs to provide an explicit list of targeted hardware (features), which means that code will only provide portable performance on hardware that they thought about. It would be desirable to have a shortcut to support all features of all hardware, for "high-level" scenarios where the compile-time impact of this is bearable.
- No access is provided to the native vector width (which I would define as the width where the hardware performs best when performing basic vertical ALU operations like ADD or MUL), which is a problem when targeting explicit SIMD APIs like
core::simd.
The second issue is closely related to #28 , and should perhaps be merged there by adding a few points to the existing discussion:
- The native vector width can depend on the element data type. For example, x86 CPUs with AVX but not AVX2 (10% of the Steam survey as of today) have 256-bit floating-point operations but 128-bit integer operations. This calls for an API that is parametrized by the desired SIMD element type, e.g.
fn native_width<T>() -> usize. - The native vector width may not be the widest supported vector width. It is relatively common for hardware to emulate wide vector operations using narrow ALUs (think low-end Intel AVX-512 CPUs, pre-Zen2 AMD AVX implementations, 128-bit NEON on ARM hardware with 64-bit ALUs...), and such hardware performs best when directly programming the narrow ALUs. This can, rigorously speaking, only be probed at runtime through microbenchmarks or a hardware database indexed by CPU ID, but some heuristics might provide a "good enough" approximation.
To address these issues, I would like multiversion to provide this sort of higher-level code generation:
// === INPUT CODE ===
#[multiversion::native_simd("f32 -> F32_WIDTH")]
fn f32_simd_alg<const F32_WIDTH: usize>(input: &[f32]) -> f32 { /* ... uses Simd<f32, F32_WIDTH> ... */ }
// === OUTPUT SEMANTICS ===
// User-facing API
//
// Note that this has two layers of dynamic dispatch, one via the match statement
// and one via the #[multiversion], that should ideally be merged into one. This
// is one motivation for proposing this as a multiversion feature rather than building
// it as a higher-level crate.
//
fn f32_simd_alg(input: &[f32]) -> f32 {
// native_simd_width is a query that returns the hardware's native vector width, if
// known. preferred_simd_width returns the compiler-configured "preferred
// vector width", if known. 4 is empirically a nearly universally supported f32 SIMD width.
match multiversion::native_simd_width::<f32>().or(multiversion::preferred_simd_width::<f32>()).unwrap_or(4) {
2 => f32_simd_alg_impl_dyn2(input),
4 => f32_simd_alg_impl_dyn4(input),
8 => f32_simd_alg_impl_dyn8(input),
16 => f32_simd_alg_impl_dyn16(input),
_ => f32_simd_alg_impl_dyn4(input),
}
}
// Concrete implementation for all possible vector widths
#[multiversion]
#[clone(/* ... all interesting hardware targets for Simd<f32, 2> ... */)]
fn f32_simd_alg_impl_dyn2(input: &[f32]) -> f32 { f32_simd_alg_impl<2>(input) }
//
#[multiversion]
#[clone(/* ... all interesting hardware targets for Simd<f32, 4> ... */)]
fn f32_simd_alg_impl_dyn4(input: &[f32]) -> f32 { f32_simd_alg_impl<4>(input) }
//
#[multiversion]
#[clone(/* ... all interesting hardware targets for Simd<f32, 8> ... */)]
fn f32_simd_alg_impl_dyn8(input: &[f32]) -> f32 { f32_simd_alg_impl<8>(input) }
//
#[multiversion]
#[clone(/* ... all interesting hardware targets for Simd<f32, 16> ... */)]
fn f32_simd_alg_impl_dyn16(input: &[f32]) -> f32 { f32_simd_alg_impl<16>(input) }
// Need inlining to make sure code is specialized for relevant hardware targets
//
// If inline(always) is still not enough, an alternative is to add a hidden and unused
// const generic tag that changes for each instantiation. This should preserve the
// benefit of minimizing code duplication, and thus compiler work, all the way down
// to generics monomorphization/LLVM IR.
//
#[inline(always)]
fn f32_simd_alg_impl<const F32_WIDTH: usize>(input: &[f32]) -> f32 { /* ... initially provided code ... */ }
Since the preferred_simd_width part requires compiler support, I brought it up here: https://internals.rust-lang.org/t/querying-the-preferred-simd-width/15829 .
In the branch I'm working on, the following is possible:
#[multiversion(targets = "simd", selected_target = "TARGET")]
fn foo(x: &mut [f32]) {
const WIDTH: usize = TARGET.suggested_simd_width::<f32>();
}
Thoughts?
This does looks cleaner and more extensible in the future, at the cost of possibly delaying the time by which this is usable on stable (since this API requires ability to subsequently instantiate Simd with an arbitrary const expression, which from my understanding is much further off in the stable future than stabilization of packed_simd). I'll let you decide how to balance that tradeoff.
This has been added to master as via the target-features crate's suggested_simd_width function. I'm pretty sure this is usable with std::simd without any additional nightly features (as long as you assign the value to a const).
I guess I should note that the suggested_simd_width function isn't perfect, it certainly can't recognize unusual things like some AVX implementations being slower than others, however in the context of multiversioning I'm not sure that's possible anyway unless you're multiversioning on specific CPU rather than features.