kNN search returns less than K results
When I experiment with the HNSW indexing method, kNN search returns less number of results than requested. Is this behavior normal? If normal, how does the algorithm decide how many neighbors to return? Thanks!
Here is a snippet of the relevant code
unsigned K = 100; // 100-NN query
KNNQuery<float> knnQ(customSpace, queryObj, K);
index->Search(&knnQ);
Well, currently the method doesn't guarantee returning at least K results. However, for large enough data set it should be at least K. How many data points do you have?
My dataset has about 150000 points, and I am requesting 100 neighbors. I only see cases that the returned number of values is less than K. You are saying that it could be greater than K in some cases?
@shubaoliu Hi! What is the value of "ef" parameter?
@shubaoliu I am saying that there's no guarantee yet. However, if getting < K NNs happens on a large data set, we need to investigate.
@yurymalkov @searchivarius Thank you for the reply! Here is the parameter I use:
AnyParams indexParams = AnyParams({"M=100”,
"efConstruction=50”,
"post=2",
"searchMethod=4"});
index->CreateIndex(indexParams);
AnyParams queryTimeParams = AnyParams({"efSearch =50”});
index->SetQueryTimeParams(queryTimeParams);
Your reply motivates me to change both 'ef' to 'efConstruction=100', 'efSearch=100', where 100 is the number of neighbors we want to retrieve. Now, I get exactly 100 results returned.
So what's your suggestion on setting these 'ef' values? And the M value?
@shubaoliu thank you for details. efConstruction shouldn't really affect this. Increase in efSearch can help, but I don't understand why there's any diff on a relatively large data set.
@shubaoliu If you set efs (both Search and Construction) lower than K (or M in case of construction), it is quite possible that the algorithm will return less than K approximate neighbors. So to get high accuracy it is a good practice to set efConstruction>M, efSearch>K. Or even efConstruction>>M, efSearch>>K in case of high dimensionality data.
@searchivarius I suppose, in case efConstruction is set too small, index construction can become unstable, leading to weird results.
Thank you for the insights on setting these parameter!
@yurymalkov yes, it's still not very likely for large data sets. Anyways, we should, perhaps, forbid setting efSearch < K. In fact, kgraph does so.
@searchivarius Maybe we should do it at least for efConstruction. For efSearch the problem is maybe somewhat less acute, since: 1) there is a second queue(and only one queue for efConstruction); 2) it is a run time parameter.
@yurymalkov can't do it for efConstruction, because K isn't known in advance. Perhaps, the easiest solution is to
- Fill the result queue with K first data points
- Use max(efSearch,K) instead of efSearch. Perhaps, 1) can be implemented at the framework level (somehow).
Any update on this issue ... I have around 2 million entries and trained with default parameters, with K=100, I am getting less than 100 similar items. Also, not sure about meanings of parameters which have to pass in the dict. How they affect the outcome. Regardless, if we need k similar items then we should be getting k items only. Let us know.
Hi @saj1919 I haven't looked into this yet, perhaps, it's time to do. Can you tell me a bit more about your data and indexing process:
- data dimensionality?
- do you know what recall do you get?
- efConstruction ?
- M?
- Do you use post = some number?
I am curious for some more insight on setting these parameters. I frequently query indexes for thousands of results (k ~ 1k-10k), so my instinct is to set efSearch to a high value. Will this fail if I do not also set a very high efConstruction? This might make index construction prohibitively expensive.
Any insight on how to tackle this problem? Is HNSW even the right algo to use if I am more interested in high recall at high k values, but precision is less important?
@phdowling there's no guarantee. But setting efSearch to a higher value usually helps. Increasing M and efConstruction also helps. Setting post to 1 or 2 also helps. However, all these measures also increase indexing time, sometimes quite dramatically.
Regarding your recall question: I understand it that you mean the extrinsic recall (the k-NN recall and precision is the same thing, if you consider intrinsic recall/precision).
Given my assumption, you may want to retrieve a lot of questions using some cheap but low-extrinsic-precision distance, right? For high recall scenarios, HNSW might be not the best tool.
What is good is certainly a difficult research question, but off the top of my head:
- One can sparsify the vectors and use classic inverted files.
- One can use a clustering algorithm (e.g., NMSLIB limplements
list_clustersflat-clustering algorithm):
release/experiment -s l2 -i ~/TextCollect/VectorSpaces/sift_texmex_base100krand.txt -m list_clusters -c bucketSize=1000 -t maxLeavesToVisit=1 -t maxLeavesToVisit=10 -t maxLeavesToVisit=50 -k 5000 -b 1 -Q 500 --threadTestQty 4 -D 100000
Also, what sort of data do you use?
I'm still having the same issue, even when using high values of efConstruction, M, efSearch, and post=2.
My data is approximately 250000 x 2600 in size, and for each query point I'm looking to retrieve k = 1000 nearest neighbors. For some of the points, the number of neighbors retrieved is around 80, instead of the 1000 needed.
@alberto-oliveira 1) could you share the data 2) what's your target retrieval speed? that's how fast should it be?