ispc
ispc copied to clipboard
Possible improvement for packed_load_active/packed_load_active2 on AVX2
This StackOverflow article describes an algorithm for packed_load_active/packed_load_active2 for AVX2. The algorithm produces a vector register, so it will need to be stored with a mask and the number of stored elements will need to be counted separately. But it still may be a better algorithm.
This needs experiments for better understanding of performance impact.
For AVX512 we use vcompressd instruction, so no need for improvement for AVX512.