hnswlib
hnswlib copied to clipboard
Best parameters for 40 million embeddings
Hi Team @piem @fabiencastan @groodt @2ooom @vinnitu @yurymalkov ,
We have a requirement of getting best match with gallery size of about 40 Million (embedding size 128) with best performance and accuracy. Can you please suggest us what could be the suitable distance type, ef, M parameters. We are having a hard time figuring out these parameters. We hope your expertise on dealing huge data could help us in refining the parameters and arriving at optimal results. Thanks in advance.
Hi @sujigrena,
The optimal parameters depends on the intrinsic data dimensionality, so it is is hard to tell the exact ones (unless you have an estimate, e.g. the clustering factor of the k-NN graph)
The distance type depends on the origin of the vectors. If those are an output of a neural network I would recommend to directly train on objective for a decided distance (by default the neural classifier is trained for inner product, this can be altered to L2 or cosine).
I would go with M
=16 first, and have a bench for checking the accuracy on the query set. Build an index, find ef
which give high recall (e.g. 0.95) and set ef_contruction
to that parameter. As a rule of thumb, increase M
if ef_consruction
is more than a thousand and repeat. Also please look at https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md
Thank you for the inputs @yurymalkov . One query here, Is there a way or any shortcut to arrive at the estimate ?