Provide a SIMD implementation of swisstable_group_query suitable for ARM
Briefly mentioned in #16, but as ARM devices become more popular, it would great to have an accelerated implementation for them as well.
According to this comment in hashbrown it might not be worth the trouble:
// Use the SSE2 implementation if possible: it allows us to scan 16 buckets // at once instead of 8. We don't bother with AVX since it would require // runtime dispatch and wouldn't gain us much anyways: the probability of // finding a match drops off drastically after the first few buckets. // // I attempted an implementation on ARM using NEON instructions, but it // turns out that most NEON instructions have multi-cycle latency, which in // the end outweighs any gains over the generic implementation.
Also, according to local benchmarks someone ran for me on an M1 MacMini, the non-SIMD version there still easily outperformed the SIMD version on an AMD Ryzen 5900x 😃
I just found this PR/discussion in the hashbrown repo: https://github.com/rust-lang/hashbrown/pull/269 Very interesting!