Upgrade IVFFlatScanner and exhaustive_L2sqr_seq
Speeds up IVFFlatScanner by helping the branch predictor and computing distances for 4 elements per loop.
Same upgrade for exhaustive_inner_product_seq() and exhaustive_L2sqr_seq() calls.
@mdouze Matthijs, could you please check whether this test failure is critical. It seems that it is about epsilons on certain architectures. On the other hand, it is seems that it is the code for dedup-ing.
@mdouze , I've upgraded the code, please take a look, whether it is simple enough or whether it needs to simplified further
OK the current version is complex but less repetitive. I'll import to fbcode.
Weird that the errors still appear. I thought I relaxed these tests. Did you rebase on latest Faiss?
@mdouze has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Do you have numbers (even very rough) that show where this modification speeds up search?
@mdouze , my mistake, let me rebase and run the test
I've used benchs/bench_ivf_selector.cpp, fixed its errors (missing #include <faiss/impl/FaissAssert.h>) and made sure that cmake version compiles the code for AVX2, and used IVF1024,Flat index (instead of a default IVF1024,SQ8).
Two cases: d=64 and d=768. I'm not sure whether the results are super reliable for d=64 (I'm running one on a VM), but it seems that it became a bit slower, while for d=768 the speedup is visible and reproducible.
I think that I either need to add some threshold for d (say, starting from d=512, otherwise the new version is plain slower), or make it optional, or check the code further. So, please don't land this diff yet.
d=64, Baseline:
index_key=IVF1024,Flat
Read /tmp/bench_ivf_selector_IVF1024,Flat.faissindex
parallel_mode=0
search time, no selector: 89.960 ms
search time with nullptr selector: 247.900 ms
search time with selector: 244.995 ms
search time with null selector + manual parallel: 246.943 ms
parallel_mode=3
search time, no selector: 85.567 ms
search time with nullptr selector: 85.795 ms
search time with selector: 85.635 ms
search time with null selector + manual parallel: 87.590 ms
set single thread
parallel_mode=0
search time, no selector: 259.159 ms
search time with nullptr selector: 226.071 ms
search time with selector: 221.616 ms
search time with null selector + manual parallel: 221.348 ms
d=64, Candidate:
index_key=IVF1024,Flat
Read /tmp/bench_ivf_selector_IVF1024,Flat.faissindex
parallel_mode=0
search time, no selector: 93.027 ms
search time with nullptr selector: 252.398 ms
search time with selector: 247.649 ms
search time with null selector + manual parallel: 248.685 ms
parallel_mode=3
search time, no selector: 88.478 ms
search time with nullptr selector: 87.638 ms
search time with selector: 89.603 ms
search time with null selector + manual parallel: 91.405 ms
set single thread
parallel_mode=0
search time, no selector: 254.079 ms
search time with nullptr selector: 218.183 ms
search time with selector: 229.076 ms
search time with null selector + manual parallel: 217.755 ms
d=768, Baseline:
index_key=IVF1024,Flat
Read /tmp/0bench_ivf_selector_IVF1024,Flat.faissindex
parallel_mode=0
search time, no selector: 938.073 ms
search time with nullptr selector: 2196.695 ms
search time with selector: 2214.705 ms
search time with null selector + manual parallel: 2249.250 ms
parallel_mode=3
search time, no selector: 928.345 ms
search time with nullptr selector: 930.376 ms
search time with selector: 937.659 ms
search time with null selector + manual parallel: 934.082 ms
set single thread
parallel_mode=0
search time, no selector: 2281.896 ms
search time with nullptr selector: 2211.583 ms
search time with selector: 2216.971 ms
search time with null selector + manual parallel: 2221.527 ms
d=768, Candidate:
index_key=IVF1024,Flat
Read /tmp/0bench_ivf_selector_IVF1024,Flat.faissindex
parallel_mode=0
search time, no selector: 893.283 ms
search time with nullptr selector: 1847.295 ms
search time with selector: 1873.944 ms
search time with null selector + manual parallel: 1932.494 ms
parallel_mode=3
search time, no selector: 902.230 ms
search time with nullptr selector: 889.027 ms
search time with selector: 891.625 ms
search time with null selector + manual parallel: 901.839 ms
set single thread
parallel_mode=0
search time, no selector: 1885.094 ms
search time with nullptr selector: 1861.240 ms
search time with selector: 1846.442 ms
search time with null selector + manual parallel: 1849.544 ms
@mdouze , additionally, I will modify fvec_L2sqr_by_idx() and fvec_inner_products_by_idx() in the same way, because it helps IndexRefineFlat
I'm still playing with it
What is the current status of this PR?
@mdouze Technically, it works and I can rebase it on the master branch. But this diff improves the performance for large dim datasets, but may degrade the performance for smaller dim datasets. I've enabled it in Zilliz version of faiss, because it mostly operates with dim like 512+. But I'm not sure for the general purpose case.
Is the benchmarking set that Gergely works on ready to be tried?
hi @alexanderguzhva , there are some conflicts which would you like to resolve them first?
@junjieqi I'll need to reiterate this diff, because it improves the performance in certain cases and hurts in others.
Let's consider this diff is a draft for now.