Expose mask to integer conversions in rten-simd
I wanted to port some code from the wide crate.
I work with bitmasks quite a bit for highly parallel search algorithms, and looked for a way to get a bitmask from the rten-simd masks, like I can via wide's to_bitmask methods, and couldn' t find one.
It's possible I missed it, but looking for search matches in code for eg _mm256_movemask_epi8 also seems to show it's only used as implementation detail for other operations, but never exposed directly.
I work with bitmasks quite a bit for highly parallel search algorithms, and looked for a way to get a bitmask from the rten-simd masks, like I can via wide's to_bitmask methods, and couldn' t find one.
You didn't miss anything, there isn't an API for this yet. I think the implementation would involve adding a new associated type to the Mask trait, to represent the bitmask itself, and then adding a to_bitmask method on the MaskOps trait or a sub-trait. This is similar to how numeric methods are exposed on the base NumOps trait or a sub-trait.
The bitmask type would presumably vary between ISAs depending on the lane count, and I'm not sure what RVV / SVE do (these ISAs are not supported yet due to Rust limitations, but the API is designed to be ready for them in future). Operations on the bitmask would probably be handled by specifying bounds on the bitmask type, similar to Simd::Array.
The bitmask type would presumably vary between ISAs depending on the lane count
One possible solution could be to instead leverage the bitvec crate which can already abstract over different storage types and lengths with a consistent bitmask API.
One possible solution could be to instead leverage the bitvec crate which can already abstract over different storage types and lengths with a consistent bitmask API.
Thanks for the suggestion, I'll take a look. A caveat is that this project has a fairly conservative approach to adding dependencies.
I looked into what Google's Highway library does for bitmasks.
- There is a
BitsFromMaskmethod that returns a u64. This is available for ISAs where the SIMD width is <= 512 bits. SVE and RVV can be wider, but I believe almost all current hardware is <= 512 bits. - There is a
StoreMaskBitsmethod that writes to a user-providedu8buffer. This is less efficient, depending on how smart the compiler is, but supports wider vectors. - There are higher-level methods for testing masks. From what I can gather this seems to be the preferred approach in terms of performance portability.
How does the code you are porting use bitmasks?
How does the code you are porting use bitmasks?
It's somewhat complicated, but the simplified version is that it's essentially a search algorithm that needs to find first position that satisfies f(x) > threshold with x being i32.
So it uses SIMD to do the batch calculations and comparisons on however many i32s fit into a vector on the current hardware, then packs that via -> i16s -> i8s, and finally converts to the bitmask which, even for 64 i8s you can fit on AVX512, can indeed fit nicely into a single u64.
At this point it's just a matter of doing a single-instruction mask.trailing_zeros() to get the position within the chunk. In some other places I also need mask.count_ones() to know number of matches.
- There is a
BitsFromMaskmethod that returns a u64. This is available for ISAs where the SIMD width is <= 512 bits. SVE and RVV can be wider, but I believe almost all current hardware is <= 512 bits.
Yeah I think u64 is a perfectly reasonable limit, although I don't see why we can't vary it by mask type via associated type like you said.
Plus, that allows us to extend it to u128 in case some hardware requires it in the future, without penalising all the other users.
- There are higher-level methods for testing masks. From what I can gather this seems to be the preferred approach in terms of performance portability.
These high-level methods look pretty good too, and indeed would satisfy the usecase at hand. Idk about usecases others might have.
Come to think of it, another alternative could be to depend on num_traits instead, and just expose fn to_bitmask() -> impl PrimInt.
PrimInt has all the required methods - https://docs.rs/num-traits/latest/num_traits/int/trait.PrimInt.html - and having opaque type with a generic trait would allow to easily extend it in the future to custom types for SVE with full backward compatibility.
I think that could work. num_traits is already used indirectly by the main crate in this project.
The next question is how the complexity/efficiency of a to_bitmask implementation for each platform compares to the higher-level methods that could be used instead. On x86 getting a bitmask is trivial. From a quick look around the Highway codebase it looks a bit more involved for WASM and Arm Neon.
I think for Wasm it's fairly straightforward too - you just use i8x16.bitmask on each 128 bit vector.
For Neon... Yeah, from quick search it does seem a bit more involved.
and having opaque type with a generic trait
Actually, I just realised there is also a place where I use pdep on the result, and pdep doesn't have a num-traits / high-level Rust API, so an opaque type wouldn't be enough.
Hmm... that's a bit more exotic. Something I didn't mention earlier is a general workaround for missing features in the portable APIs, which is to wrap and extend the base ISAs. I have a recent example of doing this for i8 x i8 -> i32 dot product instructions in https://github.com/robertknight/rten/blob/9d264af6e1c63cfe866320645817d44fc0f69c58/rten-gemm/src/i8dot.rs#L12. This does require duplicating the SimdOp::dispatch logic for the ISAs you want to support, but it has the upside that you can use whatever architecture-specific instructions you like.
Hm yeah but at that point I might as well use the std::arch APIs directly, it's the abstractions that are tempting.
I guess for now I can keep using wide which already has said to_bitmask bindings for variety of ISAs, but I like where the higher-level APIs of this project (and pulp, which I also looked at) are going and will keep an eye on it.