lance
lance copied to clipboard
implement sse based argmin
Description WIP
we got 5% speed up in sift1m with this single callsite.
TODO: convert each argmin callsites to SSE and check benchmark
AVX is too wide and isn't faster. At lease for this callsite
Is it because that this is just straight do scanning over memory?
How applicable is https://en.algorithmica.org/hpc/algorithms/argmin/ ?
Is it because that this is just straight do scanning over memory?
How applicable is https://en.algorithmica.org/hpc/algorithms/argmin/ ?
I haven't profiled it. But I think it's mostly becasue our datasize is small when calling argmin
The last algo is interesting and probably worth trying.
I don't think unroll would help here, because we don't actually have that many elements when we call argmin
.
Another idea I think is worth exploring is trying to fused distance function and argmin into one routine.
Another idea I think is worth exploring is trying to fused distance function and argmin into one routine.
what kind of improvement do you expect from it? so It is mainly save memory save/load, iiuc?
It is just that this 5% seems a bit small to justify SIMD :(
what kind of improvement do you expect from it? so It is mainly save memory save/load, iiuc?
It is just that this 5% seems a bit small to justify SIMD :(
I was expecting something like 15%, since argmin is 20+% of total index build time. Let me migrate more callsites and see how it goes.
To get more perf, I think we might have to find way to reduce how many things we need to compute in kmeans.