SIMD performance comparison
Hi, it is really useful to see how SIMD works at 4-bit PQ by simulate_kernels_PQ4, but I'm wondering why the first attempt is worse than the scann implementation since it seems that the first attempt has fewer loops. And furthermore, do we have a real performance comparison instead of this simulation?
I did the comparison (in C of course) and it was much slower than the code layout used in scann.
Another question... why we have such code layout? 0, 8, 1, 9.... will the sequence affect the efficiency?
Why don't we set 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 in the first 4bit of each byte in one 128 bit register.
@mdouze Do you know the reason...?
I also got the same result that it is more slower with simd