portable_simd bitmask + select generates poor code on ARM64
Consider the following piece of code:
#![feature(portable_simd)]
use core::simd::*;
type T = u32;
fn if_then_else64(mask: u64, if_true: &[T; 64], if_false: &[T; 64]) -> [T; 64] {
let tv = Simd::<T, 64>::from_slice(if_true);
let fv = Simd::<T, 64>::from_slice(if_false);
let mv = Mask::<<T as SimdElement>::Mask, 64>::from_bitmask(mask);
mv.select(tv, fv).to_array()
}
On Intel this generates decent code, using masked moves if AVX-512 is available, and if only AVX2 is available it spreads the mask along a vector register, and uses vpand and vpcmpeqd with vectors like [1, 2, 4, 8, 16, 32, 64, 128] to generate masks to blend if_true and if_false using vblendvps.
On ARM it's a different story. It moves the mask one bit at a time into registers before finishing off with a bunch of non-trivial shuffle/comparison instructions and bsl.16b to finally do the blends. It is possible to apply essentially the exact same strategy as the code generated on Intel, here coded manually for u32:
fn if_then_else64_manual_u32(mut mask: u64, if_true: &[u32; 64], if_false: &[u32; 64]) -> [u32; 64] {
let mut out = [0; 64];
let mut offset = 0;
for _ in 0..2 {
let mut bit = Simd::<u32, 4>::from_array([1, 2, 4, 8]);
let mut mv = Simd::<u32, 4>::splat(mask as u32);
for _ in 0..8 {
let tv = Simd::<u32, 4>::from_slice(&if_true[offset..offset+4]);
let fv = Simd::<u32, 4>::from_slice(&if_false[offset..offset+4]);
let mv_full = (mv & bit).simd_eq(bit);
let ret = mv_full.select(tv, fv);
out[offset..offset+4].copy_from_slice(&ret[..]);
bit = bit << 4;
offset += 4;
}
mask >>= 32;
}
out
}
Note that the above isn't some novel trick or anything, it's almost entirely a 1:1 translation of what the compiler generates on AVX2, just using 4-wide instead of 8-wide registers. You can see for yourself how similar the assembly of Intel using if_then_else64 and ARM using if_then_else64_manual_u32 is.
The above is ~3.2x faster on my Apple M1 machine than if_then_else64, assuming all data is in cache. I would really like it if the compiler could generate this code automatically, just like it does on Intel.
I think this is an llvm issue, as it can be reproduced with llvm-ir and also with fewer lanes that match the native vector size: https://llvm.godbolt.org/z/7YeGs6Ezs
My understanding is that this mask handling for avx happens relatively late in the backend and is also not consistent. For example it works with platform specific llvm.x86.avx2.gather intrinsics, but not with portable llvm.masked.gather.