hnswlib icon indicating copy to clipboard operation
hnswlib copied to clipboard

question on Choosing M and dataset type

Open jianshu93 opened this issue 2 years ago • 4 comments

Hello Team,

I am wondering how to choose the M for starters or with a new dataset that has never been studied. Just randomly choosing one between 16 and 64 and gradually increasing it by looking at recall and accuracy (normally small)? What if the data size is several million or even larger with highly clustered data point, what is the maximum M, will larger value greatly impair the search and build performance, say M=256 or above, for 10 million data point (Assume we have enough memory). Will a pre-cluster step or dimension reduction helpful for highly clustered data point or very high dimension. And how to determine whether a new dataset with a certain size and dimension will or will not efficiently benefit from building a HNSW. A very interesting paper I read (https://arxiv.org/abs/1904.02077), but I want to know your thinking on this.

Thanks,

Jianshu

jianshu93 avatar Mar 08 '22 15:03 jianshu93

If you want to automate selection of M, I'd build a graph and look at the clustering coefficient or calculate lid to infer parameters as done in some papers. In practical scenarios with high intrinsic-dim-data I think M is mostly bounded by the memory consumption, so you also would want to consider that. As for whether to use hierarchical levels or not - my answer is yes, it never hurts and sometimes help - those points discussed in the original (TPAMI) paper on HNSW.

yurymalkov avatar Mar 10 '22 06:03 yurymalkov

Hello

Thanks for the response. I am dealing with a 10 million biological dataset (bacteriophage), they are highly clustered. With respect to dimension, we use minhash (based on kmers) to approximate distance for each bacteriophage sequence (a dimension reduction scheme I would say, so normally project to small dimension), so not a problem. However, with M 128 and ef 1600, the accuracy is still very low, about only 50% (half of the query, best neighbor not found compare to brute-force minhash, another half, best found but second or third best are missed). Tried both pruned and extended candidates, still no clear improvement. It seems to me that it was trapped in local minima. I am wondering whether you have had similar dataset and will a 512 M sounds unacceptable (as I said, highly clustered), let's assume we have enough memory?

Thanks,

Jianshu

jianshu93 avatar Mar 11 '22 04:03 jianshu93

Hey @jianshu93, you've tried extended candidates (it is only available in the nmslib implementation) and still the performance was low? That should not be the case, extension of neighborhood should solve the clustering case. I wonder if you have many (near-)exact duplicates in the dataset?

yurymalkov avatar Mar 13 '22 07:03 yurymalkov

@jianshu93 hi , have you sovled the problem?

ykang-cool avatar Jul 26 '22 10:07 ykang-cool