SIMD optimization of iteration
Would be interesting to look at SIMD for optimizing the BitIter. Unsure if it would be worth it with some extensions like AVX512 due to it generally slowing down the cpu frequency, but SSE2 is probably worth exploring and benchmarking.
Some resources for that: https://doc.rust-lang.org/stable/std/macro.is_x86_feature_detected.html https://github.com/AdamNiederer/faster
You mean that we would provide simd_iter version of BitIter, which would construct a vectors of indices that are set, so SIMD operations can be done on them?
Yes, or even modifying the BitIter implementation (falling back to the current if no SSE2 exists on the system).
I think modifying the current implementation would be better, of course we should only actually merge this if it does mean that we get performance gains, especially in larger examples, since some SIMD instructions tend to lower cpu clock frequency (AVX512) which is not good for performance.
But iterator and simd_iterator have completely different APIs, so I don't see how modifying BitIter would be an option.
Oh I didn't mean implementing simd_iterator, I just mean having some more state in BitIter that allows for multiple index processing in one iteration (which we then use for the next iteration) or something similar.