faiss icon indicating copy to clipboard operation
faiss copied to clipboard

1-recall constant for independent indexes.

Open OmniscienceAcademy opened this issue 2 years ago • 2 comments

Hi, Thank you so much for this library. I am currently building an academic search engine, and faiss is one of the core technology. You can see the project here if you are curious to know how your library is used (https://omniscience.academy/). Currently, the index accuracy is not satisfactory. So I'm trying to push forward a bit this metric.

We are a bit budget constrained, this is why I'm here using only memory mappable indices. I've tried several types of index, among which the 3 following indices. The thing that I do not understand is the constant 1-recall.

OPQ256_1024,IVF65536_HNSW32,PQ256x8 (30Go) { "size in bytes": 36352330248, "avg_search_speed_ms": 209.1428671798591, "99p_search_speed_ms": 309.3958507361822, "reconstruction error %": 6.688470393419266, "nb vectors": 136595995, "vectors dimension": 768, "compression ratio": 11.543218654135286, "1-recall@20": 0.802, "1-recall@40": 0.824, "20-recall@20": 0.716499999999999, "40-recall@40": 0.7253000000000002 }

OPQ768_768,IVF262144_HNSW32,PQ768x4fsr (50Go) { "size in bytes": 54426777864, "avg_search_speed_ms": 189.02690272848562, "99p_search_speed_ms": 282.19754666089995, "reconstruction error %": 4.380467906594276, "nb vectors": 136595995, "vectors dimension": 768, "compression ratio": 7.709861085080971, "1-recall@20": 0.802, "1-recall@40": 0.824, "20-recall@20": 0.7159999999999991, "40-recall@40": 0.7237000000000001 }

OPQ768_768,IVF262144_HNSW32,PQ768x8 (100Go) { "size in bytes": 106880377224, "avg_search_speed_ms": 111.62540741627001, "99p_search_speed_ms": 284.8180585540831, "reconstruction error %": 0.6758840288966894, "nb vectors": 136595995, "vectors dimension": 768, "compression ratio": 3.9260985743019408, "1-recall@20": 0.802, "1-recall@40": 0.824, "20-recall@20": 0.7667999999999987, "40-recall@40": 0.77405 }

My 1-recall is way too constant. It seems like a bug, but I use autofaiss to produce those metrics, and the other numbers are varying, so I think there is no bug here.

Furthermore, when I vary nprobes, the 20-recall are varying asymptotically to 0.82, but the 1-recall does not move at all.

Do you have an idea of what is happening?

OmniscienceAcademy avatar May 08 '22 13:05 OmniscienceAcademy

First observation is that increasing the nprobe will saturate at some point to the accuracy of PQ encoding so your observation makes sense.

Your PQ codes are very large, so if the input vectors are reasonably sized and the distribution is not too weird they should yield perfect precision. Therefore, my guess is that there are duplicate vectors in the database that are counted as negatives.

See also https://github.com/facebookresearch/faiss/wiki/FAQ#analyzing-accuracy-issues-with-indexivfpq

mdouze avatar May 09 '22 07:05 mdouze

Thank you. This is certainly the reason

OmniscienceAcademy avatar May 10 '22 21:05 OmniscienceAcademy