ndarray Use std::arch for SIMD and target

trafficstars

See rust-lang/rust/issues/29717

Use to select impl for unrolled dot product and scalar sum.

Jan 09 '16 11:01 bluss

Preferred approach would be to move the heavy lifting and inner loops (dot product etc) to a separate crate in the style of https://github.com/bluss/numeric-loops or another existing already simdified crate.

Nov 13 '18 22:11 bluss

@bluss I am contributing to std::arch to make it a stable feature as soon as possible. I would like to undertake the simd-realization of ndarray. I think we can create a new branch from master for realizing and discussing.

The following is a very simple example:

#![feature(stdsimd)]
#![feature(stdsimd_internal)]
use ndarray::*;
use core_arch::simd::*;
use core_arch::simd_llvm::*;
use std::intrinsics::transmute;
use core_arch::arch::x86_64::{__m128bh, m128bhExt};

// Just for demonstration, much faster way is supposed to be used.
pub fn simd_arr1(xs: &[i32]) -> Array1<i32x4> {
    let len = xs.len();
    assert!(len % 4 == 0);
    let mut i = 0;
    let mut v: Vec<i32x4> = Vec::new();
    while i + 4 <= len {
        v.push(i32x4::new(xs[i], xs[i+1], xs[i+2], xs[i+3]));
        i += 4;
    }
    ArrayBase::from(v)
}

fn main() {
    let a = arr1(&[1, 2, 3, 4, 5, 6, 7, 8]);
    let b = arr1(&[1, 2, 3, 4, 5, 6, 7, 8]);
    let c = Zip::from(&a).and(&b).map_collect(|x, y| x * y);
    println!("{}", c);

    let a_simd = simd_arr1(&[1, 2, 3, 4, 5, 6, 7, 8]);
    let b_simd = simd_arr1(&[1, 2, 3, 4, 5, 6, 7, 8]);
    unsafe {
        let c_simd = Zip::from(&a_simd).and(&b_simd).map_collect(|x, y| simd_mul(transmute::<_, __m128bh>(x.clone()), transmute::<_, __m128bh>(y.clone())).as_i32x4());
        println!("{:?}", c_simd);
    }
}

Output:

[1, 4, 9, 16, 25, 36, 49, 64]
[i32x4(1, 4, 9, 16), i32x4(25, 36, 49, 64)], shape=[2], strides=[1], layout=CFcf (0xf), const ndim=1

Mar 09 '21 03:03 SparrowLii

Hey, it's good if we talk about this before you get started. Notice that in this issue - it's not intended to be about arrays using those explicit simd types at all - that would be a different design - accelerating operations on Array<f64, _> would be a lot more interesting.

IMO simd that we are most interested in, for x86 at least, is already stable.

Notice also in this issue that I have suggested that any simd code like that happens in a new crate that we depend on. That means, it is not part of the ndarray crate.

Mar 09 '21 19:03 bluss

@bluss Then I hope we create such a crate in rust-ndarray ( instead of a personal crate). So do we need a crate similar to universal intrinsics? Or we can also refer to usimd in numpy. Yes, std::arch for x86 and x86_64 are already stable, I can start from here right away.

Mar 09 '21 20:03 SparrowLii

I tried to use the simd in the operator overloading of multiplication. here. And put the usage of avx512f instructions in another crate Then a simd test was performed on an array with a scale of 500x500: main.rs:

use ndarray::Array;
use std::time;
use ndarray_rand::RandomExt;
use ndarray_rand::rand::distributions::Uniform;

fn main() {
    // f64
    let a = Array::random((500, 500), Uniform::new(0., 2.));
    let b = Array::random((500, 500), Uniform::new(0., 2.));
    let start = time::SystemTime::now();
    let c_simd = &a * &b;
    let end = time::SystemTime::now();
    println!("simd f64 {:?}",end.duration_since(start).unwrap());

    let start = time::SystemTime::now();
    let c = a * b;
    let end = time::SystemTime::now();
    println!("normal f64 {:?}",end.duration_since(start).unwrap());
    assert_eq!(c_simd, c);

    // i32
    let a = Array::random((500, 500), Uniform::new(0, 255));
    let b = Array::random((500, 500), Uniform::new(0, 255));
    let start = time::SystemTime::now();
    let c_simd = &a * &b;
    let end = time::SystemTime::now();
    println!("simd i32 {:?}",end.duration_since(start).unwrap());

    let start = time::SystemTime::now();
    let c = a * b;
    let end = time::SystemTime::now();
    println!("normal i32 {:?}",end.duration_since(start).unwrap());
    assert_eq!(c_simd, c);
}

The result is as follows:

simd f64 6.6887ms
normal f64 14.7793ms
simd i32 3.4118ms
normal i32 13.6641ms

The operation of f64 has been accelerated by 2x+, and the operation of i32 has been accelerated by 4x+.

I'm wondering if I am working in the right direction.

Mar 13 '21 10:03 SparrowLii

@bluss Could you help pointing out which methods in ndarray should use simd in the first place?

Mar 14 '21 15:03 SparrowLii

Here is my plan

Build a more easy-to-use simd crate based on stdarch and stdsimd which implements automatic detection of hardware characteristics, doesn't distinguish the vector lengths.
Help the compiler team to complete specialization. In this way, simd acceleration can be achieved with little changing in ndarray. And it can also solve the issue of broadcasting. This looks crazy but I will try my best

Apr 29 '21 09:04 SparrowLii

I think you may be interested in this project, when simd is in std possibly ndarray will support this to further improve its performance.

Jul 26 '23 13:07 dafmdev

Preferred approach would be to move the heavy lifting and inner loops (dot product etc) to a separate crate in the style of https://github.com/bluss/numeric-loops or another existing already simdified crate.

Is anybody working on this, or any reason I shouldn't attempt it?

Just to clarify, I'm assuming this means extracting the internal contents (like loops and basic operations) of the existing Ndarray functions into a separate crate ndarray-core , which can then be feature flagged or swapped with another?

Jan 04 '24 01:01 skewballfox

ndarray
ndarray copied to clipboard

Use std::arch for SIMD and target_feature

ndarray ndarray copied to clipboard

Use std::arch for SIMD and target_feature

ndarray
ndarray copied to clipboard