nmslib icon indicating copy to clipboard operation
nmslib copied to clipboard

kNN search returns less than K results

Open shubaoliu opened this issue 9 years ago • 17 comments

When I experiment with the HNSW indexing method, kNN search returns less number of results than requested. Is this behavior normal? If normal, how does the algorithm decide how many neighbors to return? Thanks!

Here is a snippet of the relevant code unsigned K = 100; // 100-NN query KNNQuery<float> knnQ(customSpace, queryObj, K); index->Search(&knnQ);

shubaoliu avatar Feb 20 '17 17:02 shubaoliu

Well, currently the method doesn't guarantee returning at least K results. However, for large enough data set it should be at least K. How many data points do you have?

searchivarius avatar Feb 20 '17 17:02 searchivarius

My dataset has about 150000 points, and I am requesting 100 neighbors. I only see cases that the returned number of values is less than K. You are saying that it could be greater than K in some cases?

shubaoliu avatar Feb 20 '17 17:02 shubaoliu

@shubaoliu Hi! What is the value of "ef" parameter?

yurymalkov avatar Feb 20 '17 17:02 yurymalkov

@shubaoliu I am saying that there's no guarantee yet. However, if getting < K NNs happens on a large data set, we need to investigate.

searchivarius avatar Feb 20 '17 17:02 searchivarius

@yurymalkov @searchivarius Thank you for the reply! Here is the parameter I use:

AnyParams indexParams = AnyParams({"M=100”,
                        "efConstruction=50”,
                        "post=2", 
                        "searchMethod=4"});
index->CreateIndex(indexParams);

AnyParams queryTimeParams = AnyParams({"efSearch =50”});
index->SetQueryTimeParams(queryTimeParams);

Your reply motivates me to change both 'ef' to 'efConstruction=100', 'efSearch=100', where 100 is the number of neighbors we want to retrieve. Now, I get exactly 100 results returned.

So what's your suggestion on setting these 'ef' values? And the M value?

shubaoliu avatar Feb 20 '17 18:02 shubaoliu

@shubaoliu thank you for details. efConstruction shouldn't really affect this. Increase in efSearch can help, but I don't understand why there's any diff on a relatively large data set.

searchivarius avatar Feb 20 '17 18:02 searchivarius

@shubaoliu If you set efs (both Search and Construction) lower than K (or M in case of construction), it is quite possible that the algorithm will return less than K approximate neighbors. So to get high accuracy it is a good practice to set efConstruction>M, efSearch>K. Or even efConstruction>>M, efSearch>>K in case of high dimensionality data.

@searchivarius I suppose, in case efConstruction is set too small, index construction can become unstable, leading to weird results.

yurymalkov avatar Feb 20 '17 18:02 yurymalkov

Thank you for the insights on setting these parameter!

shubaoliu avatar Feb 20 '17 18:02 shubaoliu

@yurymalkov yes, it's still not very likely for large data sets. Anyways, we should, perhaps, forbid setting efSearch < K. In fact, kgraph does so.

searchivarius avatar Feb 20 '17 18:02 searchivarius

@searchivarius Maybe we should do it at least for efConstruction. For efSearch the problem is maybe somewhat less acute, since: 1) there is a second queue(and only one queue for efConstruction); 2) it is a run time parameter.

yurymalkov avatar Feb 21 '17 08:02 yurymalkov

@yurymalkov can't do it for efConstruction, because K isn't known in advance. Perhaps, the easiest solution is to

  1. Fill the result queue with K first data points
  2. Use max(efSearch,K) instead of efSearch. Perhaps, 1) can be implemented at the framework level (somehow).

searchivarius avatar Feb 21 '17 15:02 searchivarius

Any update on this issue ... I have around 2 million entries and trained with default parameters, with K=100, I am getting less than 100 similar items. Also, not sure about meanings of parameters which have to pass in the dict. How they affect the outcome. Regardless, if we need k similar items then we should be getting k items only. Let us know.

saj1919 avatar Feb 22 '18 12:02 saj1919

Hi @saj1919 I haven't looked into this yet, perhaps, it's time to do. Can you tell me a bit more about your data and indexing process:

  1. data dimensionality?
  2. do you know what recall do you get?
  3. efConstruction ?
  4. M?
  5. Do you use post = some number?

searchivarius avatar Feb 22 '18 17:02 searchivarius

I am curious for some more insight on setting these parameters. I frequently query indexes for thousands of results (k ~ 1k-10k), so my instinct is to set efSearch to a high value. Will this fail if I do not also set a very high efConstruction? This might make index construction prohibitively expensive.

Any insight on how to tackle this problem? Is HNSW even the right algo to use if I am more interested in high recall at high k values, but precision is less important?

phdowling avatar Oct 19 '18 14:10 phdowling

@phdowling there's no guarantee. But setting efSearch to a higher value usually helps. Increasing M and efConstruction also helps. Setting post to 1 or 2 also helps. However, all these measures also increase indexing time, sometimes quite dramatically.

Regarding your recall question: I understand it that you mean the extrinsic recall (the k-NN recall and precision is the same thing, if you consider intrinsic recall/precision).

Given my assumption, you may want to retrieve a lot of questions using some cheap but low-extrinsic-precision distance, right? For high recall scenarios, HNSW might be not the best tool.

What is good is certainly a difficult research question, but off the top of my head:

  1. One can sparsify the vectors and use classic inverted files.
  2. One can use a clustering algorithm (e.g., NMSLIB limplements list_clusters flat-clustering algorithm):
release/experiment -s l2 -i ~/TextCollect/VectorSpaces/sift_texmex_base100krand.txt -m list_clusters -c bucketSize=1000 -t maxLeavesToVisit=1 -t maxLeavesToVisit=10 -t maxLeavesToVisit=50 -k 5000  -b 1 -Q 500 --threadTestQty 4  -D 100000

Also, what sort of data do you use?

searchivarius avatar Oct 19 '18 15:10 searchivarius

I'm still having the same issue, even when using high values of efConstruction, M, efSearch, and post=2.

My data is approximately 250000 x 2600 in size, and for each query point I'm looking to retrieve k = 1000 nearest neighbors. For some of the points, the number of neighbors retrieved is around 80, instead of the 1000 needed.

alberto-oliveira avatar Jun 10 '19 21:06 alberto-oliveira

@alberto-oliveira 1) could you share the data 2) what's your target retrieval speed? that's how fast should it be?

searchivarius avatar Jun 10 '19 22:06 searchivarius