rfcs Struct target features RFC

Nov 12 '23 14:11 sarah-quinones

As a distraction, clang and gcc support multi versioning for functions: https://maskray.me/blog/2023-02-05-function-multi-versioning

Dec 21 '23 11:12 tschuett

Compilers largely don't know the semantics of ifunc and are very conservative. Ifunc defeats most interprocedural optimizations. We can see that the target_clones function foo is not inlined into foo_plus_1. Fortunately, functions called by a target_clones function are still inlinable.

An advantage of the current proposal is that the indirection/specialization can be hoisted higher without an exclusive entry point, with inlining freely supported beneath.

Dec 21 '23 13:12 jedbrown

I love the concept of integrating target features with the type system. It's a creative way to solve the problem of statically guaranteeing you have already detected a given feature.

How do you anticipate this scaling for code that has a half-dozen or more different optimized versions, for a few different targets, and wants to handle both compile-time and runtime detection? Could you give a sketch of how you think such code could look?

Using a generic for the feature type avoids duplication at the call site and the called function. What I'm wondering about is how the detection scales.

Would it potentially make sense for the standard library to have a single magic fn detect_features -> impl Simd that does all the detection and returns a static but opaque type for the program to pass around?

That wouldn't preclude us from also having individual types for code that wants to statically guarantee a particular feature.

Mar 03 '24 23:03 joshtriplett

i don't know if fn detect_features -> impl Simd would be that great of an option for the runtime detection use-case. it would have to return a type known at compile-time, so it can only enable the features that were already enabled at comptime, not those detected at runtime.

the way i envision it is scaling up in user code is something like this (simplified), which is based on the design i chose for my simd project https://docs.rs/pulp/0.18.8/pulp/trait.Simd.html

with a bit more work, the code can be made even generic over the data type, and this scales up quite well and forms the basis of faer, which is a high perf linear algebra library https://docs.rs/faer-core/0.17.1/faer_core/group_helpers/struct.SimdFor.html

this is only an example, and alternative designs may be possible. so im just sharing what already works for me

use core::arch::{Scalar, x86_64};
use core::mem::transmute;
use bytemuck::Pod;

trait F64Simd {
    type f64s: Copy + Pod; // plus any other traits a user might want
    pub fn add_f64s(self, a: Self::f64s, b: Self::f64s) -> Self::f64s;
}

impl F64Simd for Scalar {
    type f64s = f64;

	#[inline]
	pub fn add_f64s(self, a: Self::f64s, b: Self::f64s) -> Self::f64s {
		a + b
	}
}

impl F64Simd for x86_64::Avx {
    type f64s = [f64; 4];

	#[inline]
	pub fn add_f64s(self, a: Self::f64s, b: Self::f64s) -> Self::f64s {
		unsafe { x86_64::_mm256_add_pd(transmute(a), transmute(b)) }
	}
}

impl F64Simd for x86_64::Avx512f {
    type f64s = [f64; 8];

	#[inline]
	pub fn add_f64s(self, a: Self::f64s, b: Self::f64s) -> Self::f64s {
		unsafe { x86_64::_mm512_add_pd(transmute(a), transmute(b)) }
	}
}

pub fn add_comptime<S: F64Simd>(simd: S, dst: &mut [f64], a: &[f64], b: &[f64]) {
	// assume the slice length is a multiple of the register size for simplicity
	let dst = bytemuck::cast_slice_mut::<f64, S::f64s>(dst);
	let a = bytemuck::cast_slice::<f64, S::f64s>(a);
	let b = bytemuck::cast_slice::<f64, S::f64s>(b);

	for ((dst, &a), &b) in dst.iter_mut().zip(a).zip(b) {
		*dst = simd.add_f64s(a, b);
	}
}

pub fn add_runtime(dst: &mut [f64], a: &[f64], b: &[f64]) {
	if let Some(simd) = Avx512f::try_new() {
		return add_comptime(simd, dst, a, b);
	}
	if let Some(simd) = Avx::try_new() {
		return add_comptime(simd, dst, a, b);
	}
	return add_comptime(Scalar, dst, a, b);
}

Mar 04 '24 07:03 sarah-quinones

This RFC is based on assumption, that try_new_avx512f() would be simpler and faster then is_x86_feature_detected!("avx512f") , but would it?

Mar 04 '24 08:03 VitWW

i am not making that assumption. In fact Avx512f::try_new could just be implemented as if is_x86_feature_detected!("avx512f") { Some(Self) } else { None }

note that the dispatch happens outside the loop, which means that we only check the availability once before using a vectorized impl on the whole slice

if needed, the dispatch can be moved further from the inner loop if there are multiple layers and you want everything to get inlined

Mar 04 '24 08:03 sarah-quinones

FWIW I really like this proposal, and I think it has a lot of potential for safe low-level SIMD.

I am taking a stab at implementing it to see how the resulting code would look like, as well as possible issues (for example: how does this interact with the ABI of the function? What about function pointers?).

Will update when I have something somewhat more close to being mergeable :-)

Jun 25 '24 21:06 veluca93

This RFC is exactly what I want, and I've even created a lot of the same structure myself by hand to get something similar in today's Rust. I have types that correspond to SIMD implementations of my algorithm's primitives that are unsafe to construct and then parameterize my algorithm with those types. As pseudocode, it looks like

trait Primitives {
    fn operation_1(&self);
    fn operation_2(&self);
}

mod scalar {
    pub struct Scalar;

    impl Primitives for Scalar {
        fn operation_1(&self) {}
        fn operation_2(&self) {}
    }
}

mod neon {
    pub struct Neon(());

    impl Neon {
        unsafe fn new_unchecked() -> Self { Self(()) }
    }

    impl Primitives for Neon {
        fn operation_1(&self) { unsafe { operation_1_neon() } }
        fn operation_2(&self) { unsafe { operation_2_neon() } }        
    }

    // Annoying
    #[target_feature(enable = "neon")]
    unsafe fn operation_1_neon(&self) {}

    // Annoying
    #[target_feature(enable = "neon")]
    unsafe fn operation_2_neon() {}
}

struct MyAwesomeHasher;

impl std::hash::Hasher for MyAwesomeHasher {
    fn write(&mut self, bytes: &[u8]) {
        // Annoying
        #[target_feature(enable = "neon")]
        unsafe fn do_neon(primitives: impl Primitives, this: &mut MyAwesomeHasher, bytes: &[u8]) {
            write_common(primitives, this, bytes)
        }

        // Annoying
        fn do_scalar(primitives: impl Primitives, this: &mut MyAwesomeHasher, bytes: &[u8]) {
            write_common(primitives, this, bytes)
        }

        if is_aarch64_feature_detected("neon") {
            unsafe { do_neon(neon::Neon::new_unchecked(), self, bytes) }
        } else {
            do_scalar(scalar::Scalar::new(), self, bytes)
        }
    }

    // Ditto all that for `Hasher::finish`
}

fn write_common(primitives: impl Primitives, bytes: &[u8]) {}

 // Assume everything has an inline on it, some are `inline(always)`.

I've annotated a few spots with "annoying" where I have to step out of my normal Rust flow and do something janky just to be able to use target_feature that is limited to a function. Having it be able to be attached to a type and then beneficially infect places it is used will be so much nicer. If I understand it well, it would shorten my code dramatically while making it look more like idiomatic Rust:

trait Primitives {
    fn operation_1(&self);
    fn operation_2(&self);
}

mod scalar {
    pub struct Scalar;

    impl Primitives for Scalar {
        fn operation_1(&self) {}
        fn operation_2(&self) {}
    }
}

mod neon {
    #[target_feature(enable = "neon")]
    pub struct Neon(());

    // Yay! Get the unsafe constructor for free

    // Yay! No longer have to pull bodies out to new functions
    impl Primitives for Neon {
        fn operation_1(&self) { }
        fn operation_2(&self) { }        
    }
}

// Yay! Now I know that my SIMD usage in my code will always be NEON or scalar or ... 
enum MyAwesomeHasher {
    Neon(MyAwesomeHasherRaw<neon::Neon>),
    Scalar(MyAwesomeHasherRaw<scalar::Scalar>),
}

impl MyAwesomeHasher {
    fn new() -> Self {
        if is_aarch64_feature_detected("neon") {
            unsafe { Self::Neon(MyAwesomeHasherRaw(neon::Neon::new_unchecked())) }
        } else {
            Self::Scalar(MyAwesomeHasherRaw(scalar::Scalar::new()))
        }
    }
}

impl std::hash::Hasher for MyAwesomeHasher {
    // Assume a delegating call in here, enum-dispatch style    
}

// Yay! I can now give callers an easy way to force a specific SIMD
// implementation.
struct MyAwesomeHasherRaw<P>(P);

impl<P: Primitive> std::hash::Hasher for MyAwesomeHasherRaw<P> {
    fn write(&mut self, bytes: &[u8]) {
        // Yay! No longer have to have the little shim functions
        // Yay! No longer have to pull bodies out to new functions
    }

    // Ditto all that for `Hasher::finish`
}

I've very excited to see how this progresses! Thank you for the awesome RFC @sarah-ek !

Aug 17 '24 01:08 shepmaster

rfcs rfcs copied to clipboard

Struct target features RFC

rfcs
rfcs copied to clipboard