rfcs
rfcs copied to clipboard
Struct target features RFC
As a distraction, clang and gcc support multi versioning for functions: https://maskray.me/blog/2023-02-05-function-multi-versioning
Compilers largely don't know the semantics of ifunc and are very conservative. Ifunc defeats most interprocedural optimizations. We can see that the target_clones function foo is not inlined into foo_plus_1. Fortunately, functions called by a target_clones function are still inlinable.
An advantage of the current proposal is that the indirection/specialization can be hoisted higher without an exclusive entry point, with inlining freely supported beneath.
I love the concept of integrating target features with the type system. It's a creative way to solve the problem of statically guaranteeing you have already detected a given feature.
How do you anticipate this scaling for code that has a half-dozen or more different optimized versions, for a few different targets, and wants to handle both compile-time and runtime detection? Could you give a sketch of how you think such code could look?
Using a generic for the feature type avoids duplication at the call site and the called function. What I'm wondering about is how the detection scales.
Would it potentially make sense for the standard library to have a single magic fn detect_features -> impl Simd that does all the detection and returns a static but opaque type for the program to pass around?
That wouldn't preclude us from also having individual types for code that wants to statically guarantee a particular feature.
i don't know if fn detect_features -> impl Simd would be that great of an option for the runtime detection use-case. it would have to return a type known at compile-time, so it can only enable the features that were already enabled at comptime, not those detected at runtime.
the way i envision it is scaling up in user code is something like this (simplified), which is based on the design i chose for my simd project https://docs.rs/pulp/0.18.8/pulp/trait.Simd.html
with a bit more work, the code can be made even generic over the data type, and this scales up quite well and forms the basis of faer, which is a high perf linear algebra library https://docs.rs/faer-core/0.17.1/faer_core/group_helpers/struct.SimdFor.html
this is only an example, and alternative designs may be possible. so im just sharing what already works for me
use core::arch::{Scalar, x86_64};
use core::mem::transmute;
use bytemuck::Pod;
trait F64Simd {
type f64s: Copy + Pod; // plus any other traits a user might want
pub fn add_f64s(self, a: Self::f64s, b: Self::f64s) -> Self::f64s;
}
impl F64Simd for Scalar {
type f64s = f64;
#[inline]
pub fn add_f64s(self, a: Self::f64s, b: Self::f64s) -> Self::f64s {
a + b
}
}
impl F64Simd for x86_64::Avx {
type f64s = [f64; 4];
#[inline]
pub fn add_f64s(self, a: Self::f64s, b: Self::f64s) -> Self::f64s {
unsafe { x86_64::_mm256_add_pd(transmute(a), transmute(b)) }
}
}
impl F64Simd for x86_64::Avx512f {
type f64s = [f64; 8];
#[inline]
pub fn add_f64s(self, a: Self::f64s, b: Self::f64s) -> Self::f64s {
unsafe { x86_64::_mm512_add_pd(transmute(a), transmute(b)) }
}
}
pub fn add_comptime<S: F64Simd>(simd: S, dst: &mut [f64], a: &[f64], b: &[f64]) {
// assume the slice length is a multiple of the register size for simplicity
let dst = bytemuck::cast_slice_mut::<f64, S::f64s>(dst);
let a = bytemuck::cast_slice::<f64, S::f64s>(a);
let b = bytemuck::cast_slice::<f64, S::f64s>(b);
for ((dst, &a), &b) in dst.iter_mut().zip(a).zip(b) {
*dst = simd.add_f64s(a, b);
}
}
pub fn add_runtime(dst: &mut [f64], a: &[f64], b: &[f64]) {
if let Some(simd) = Avx512f::try_new() {
return add_comptime(simd, dst, a, b);
}
if let Some(simd) = Avx::try_new() {
return add_comptime(simd, dst, a, b);
}
return add_comptime(Scalar, dst, a, b);
}
This RFC is based on assumption, that try_new_avx512f() would be simpler and faster then is_x86_feature_detected!("avx512f") , but would it?
i am not making that assumption. In fact Avx512f::try_new could just be implemented as if is_x86_feature_detected!("avx512f") { Some(Self) } else { None }
note that the dispatch happens outside the loop, which means that we only check the availability once before using a vectorized impl on the whole slice
if needed, the dispatch can be moved further from the inner loop if there are multiple layers and you want everything to get inlined
FWIW I really like this proposal, and I think it has a lot of potential for safe low-level SIMD.
I am taking a stab at implementing it to see how the resulting code would look like, as well as possible issues (for example: how does this interact with the ABI of the function? What about function pointers?).
Will update when I have something somewhat more close to being mergeable :-)
This RFC is exactly what I want, and I've even created a lot of the same structure myself by hand to get something similar in today's Rust. I have types that correspond to SIMD implementations of my algorithm's primitives that are unsafe to construct and then parameterize my algorithm with those types. As pseudocode, it looks like
trait Primitives {
fn operation_1(&self);
fn operation_2(&self);
}
mod scalar {
pub struct Scalar;
impl Primitives for Scalar {
fn operation_1(&self) {}
fn operation_2(&self) {}
}
}
mod neon {
pub struct Neon(());
impl Neon {
unsafe fn new_unchecked() -> Self { Self(()) }
}
impl Primitives for Neon {
fn operation_1(&self) { unsafe { operation_1_neon() } }
fn operation_2(&self) { unsafe { operation_2_neon() } }
}
// Annoying
#[target_feature(enable = "neon")]
unsafe fn operation_1_neon(&self) {}
// Annoying
#[target_feature(enable = "neon")]
unsafe fn operation_2_neon() {}
}
struct MyAwesomeHasher;
impl std::hash::Hasher for MyAwesomeHasher {
fn write(&mut self, bytes: &[u8]) {
// Annoying
#[target_feature(enable = "neon")]
unsafe fn do_neon(primitives: impl Primitives, this: &mut MyAwesomeHasher, bytes: &[u8]) {
write_common(primitives, this, bytes)
}
// Annoying
fn do_scalar(primitives: impl Primitives, this: &mut MyAwesomeHasher, bytes: &[u8]) {
write_common(primitives, this, bytes)
}
if is_aarch64_feature_detected("neon") {
unsafe { do_neon(neon::Neon::new_unchecked(), self, bytes) }
} else {
do_scalar(scalar::Scalar::new(), self, bytes)
}
}
// Ditto all that for `Hasher::finish`
}
fn write_common(primitives: impl Primitives, bytes: &[u8]) {}
// Assume everything has an inline on it, some are `inline(always)`.
I've annotated a few spots with "annoying" where I have to step out of my normal Rust flow and do something janky just to be able to use target_feature that is limited to a function. Having it be able to be attached to a type and then beneficially infect places it is used will be so much nicer. If I understand it well, it would shorten my code dramatically while making it look more like idiomatic Rust:
trait Primitives {
fn operation_1(&self);
fn operation_2(&self);
}
mod scalar {
pub struct Scalar;
impl Primitives for Scalar {
fn operation_1(&self) {}
fn operation_2(&self) {}
}
}
mod neon {
#[target_feature(enable = "neon")]
pub struct Neon(());
// Yay! Get the unsafe constructor for free
// Yay! No longer have to pull bodies out to new functions
impl Primitives for Neon {
fn operation_1(&self) { }
fn operation_2(&self) { }
}
}
// Yay! Now I know that my SIMD usage in my code will always be NEON or scalar or ...
enum MyAwesomeHasher {
Neon(MyAwesomeHasherRaw<neon::Neon>),
Scalar(MyAwesomeHasherRaw<scalar::Scalar>),
}
impl MyAwesomeHasher {
fn new() -> Self {
if is_aarch64_feature_detected("neon") {
unsafe { Self::Neon(MyAwesomeHasherRaw(neon::Neon::new_unchecked())) }
} else {
Self::Scalar(MyAwesomeHasherRaw(scalar::Scalar::new()))
}
}
}
impl std::hash::Hasher for MyAwesomeHasher {
// Assume a delegating call in here, enum-dispatch style
}
// Yay! I can now give callers an easy way to force a specific SIMD
// implementation.
struct MyAwesomeHasherRaw<P>(P);
impl<P: Primitive> std::hash::Hasher for MyAwesomeHasherRaw<P> {
fn write(&mut self, bytes: &[u8]) {
// Yay! No longer have to have the little shim functions
// Yay! No longer have to pull bodies out to new functions
}
// Ditto all that for `Hasher::finish`
}
I've very excited to see how this progresses! Thank you for the awesome RFC @sarah-ek !