hnswlib icon indicating copy to clipboard operation
hnswlib copied to clipboard

bad recall performance

Open tangchen2 opened this issue 2 years ago • 9 comments

i build hnsw index ( cosine distance ) based on 1000w data, however, the recall performance is bad, some query data (cosine distance very small ) not be recalled, while some other unrelated data recalled. would num_threads, M, cef or sef affect the hnsw recall? (BTW in my experiment, M=72, cef=200, sef=2000)

tangchen2 avatar Sep 15 '21 03:09 tangchen2

@tangchen2 Typically efConstruction should be bigger or equal compared to ef, so that might be one of the issues. I wonder, does your data have duplicates (elements with almost exact values)? If so, deduping should bring a huge performance/accuracy gain.

yurymalkov avatar Sep 16 '21 03:09 yurymalkov

I also encountered this problem, and I felt that hnsw's recall under the large data set was seriously reduced.

I have 5000 pairs of face matching features. I mix one 5000 into completely unrelated face gallery data sets and use the other 5000 to retrieve them.

M = 48, efConstruction = 200, feature dim = 384.

Here are the results of HNSW:

gallery size: 1 million top[1]=0.772950 top[5]=0.818441 top[10]=0.836393 top[20]=0.850265 top[50]=0.866585 top[100]=0.880661 top[256]=0.894941 top[512]=0.906365 top[1024]=0.915749

gallery size: 5 million top[1]=0.672787 top[5]=0.714810 top[10]=0.728070 top[20]=0.741942 top[50]=0.758262 top[100]=0.767238 top[256]=0.777642 top[512]=0.785190 top[1024]=0.790698

Here is the search result of GPU flat for comparison:

gallery size: 5 million top[1]=0.725622 top[5]=0.780090 top[10]=0.801102 top[20]=0.821297 top[50]=0.847001 top[100]=0.861485 top[256]=0.879845 top[512]=0.893921 top[1024]=0.907181

Is there something wrong with me? If necessary, I can open my data sets.

Jowekk avatar Sep 18 '21 02:09 Jowekk

I checked this discussion https://github.com/nmslib/hnswlib/issues/164. As to whether there are duplicate data, I use 5 million gallery data sets to retrieve themselves:

top 1: 4861229/5000000 = 0.972246 top 5: 4999996/5000000 = 0.999999 top 10: 5000000/5000000 = 1 top 20: 5000000/5000000 = 1 top 50: 5000000/5000000 = 1

It can be seen that there is a little duplicate data in my gallery data sets, but most of them still don't seem to be repeated. If there are other experimental details that need to be supplemented, I can provide corresponding experiments.

Jowekk avatar Sep 18 '21 03:09 Jowekk

@Jowekk I am not sure I understand. The recall you measure is the recall of the full ML pipeline for the same person?

yurymalkov avatar Sep 21 '21 05:09 yurymalkov

5000 pairs: paired images taken by 5000 different people. gallery data: completely unrelated to the above 5000 people.

I used 5000 images as batch queries to recall the other 5000 paired images. If top one hits 1000, then top[1] = 1000 / 5000 = 0.2 Did I describe it clearly?

Jowekk avatar Sep 22 '21 02:09 Jowekk

@Jowekk I see, didn't notice you have brute force results. I wonder, did you try to alter the ef parameter (and what happens if you set it to say, 10000).

yurymalkov avatar Sep 22 '21 05:09 yurymalkov

ef = 10000 is too big to meet our time requirements. When ef and gallery size are increased, a large number of random accesses will be caused because neighbors are randomly stored in memory. Bottlenecks accessed randomly allowed QPS to fall quickly. Do you think the number of neighbors scanned is too small? Here is a small experiment, the number of scans on neighbors in different gallery size:

gallery size 10,000 100,000 1000,000
scan num of neighbors 9049 29881 49351

Jowekk avatar Sep 22 '21 06:09 Jowekk

Looking on how recall changes with ef, while keeping other parameters fixed provides diagnostic information. E.g. if ef is extremely large and recall is low that hints the graph has disconnected components, etc

yurymalkov avatar Sep 22 '21 06:09 yurymalkov

ok, this is a good idea worth trying. thank u

Jowekk avatar Sep 22 '21 07:09 Jowekk