lance icon indicating copy to clipboard operation
lance copied to clipboard

implement sse based argmin

Open chebbyChefNEQ opened this issue 1 year ago • 4 comments

Description WIP

we got 5% speed up in sift1m with this single callsite.

TODO: convert each argmin callsites to SSE and check benchmark

AVX is too wide and isn't faster. At lease for this callsite

chebbyChefNEQ avatar Jun 12 '23 04:06 chebbyChefNEQ

Is it because that this is just straight do scanning over memory?

How applicable is https://en.algorithmica.org/hpc/algorithms/argmin/ ?

eddyxu avatar Jun 12 '23 04:06 eddyxu

Is it because that this is just straight do scanning over memory?

How applicable is https://en.algorithmica.org/hpc/algorithms/argmin/ ?

I haven't profiled it. But I think it's mostly becasue our datasize is small when calling argmin The last algo is interesting and probably worth trying.

I don't think unroll would help here, because we don't actually have that many elements when we call argmin .

Another idea I think is worth exploring is trying to fused distance function and argmin into one routine.

chebbyChefNEQ avatar Jun 12 '23 05:06 chebbyChefNEQ

Another idea I think is worth exploring is trying to fused distance function and argmin into one routine.

what kind of improvement do you expect from it? so It is mainly save memory save/load, iiuc?

It is just that this 5% seems a bit small to justify SIMD :(

eddyxu avatar Jun 12 '23 05:06 eddyxu

what kind of improvement do you expect from it? so It is mainly save memory save/load, iiuc?

It is just that this 5% seems a bit small to justify SIMD :(

I was expecting something like 15%, since argmin is 20+% of total index build time. Let me migrate more callsites and see how it goes.

To get more perf, I think we might have to find way to reduce how many things we need to compute in kmeans.

chebbyChefNEQ avatar Jun 12 '23 06:06 chebbyChefNEQ