hnswlib
hnswlib copied to clipboard
bad recall performance
i build hnsw index ( cosine distance ) based on 1000w data, however, the recall performance is bad, some query data (cosine distance very small ) not be recalled, while some other unrelated data recalled. would num_threads, M, cef or sef affect the hnsw recall? (BTW in my experiment, M=72, cef=200, sef=2000)
@tangchen2 Typically efConstruction should be bigger or equal compared to ef, so that might be one of the issues. I wonder, does your data have duplicates (elements with almost exact values)? If so, deduping should bring a huge performance/accuracy gain.
I also encountered this problem, and I felt that hnsw's recall under the large data set was seriously reduced.
I have 5000 pairs of face matching features. I mix one 5000 into completely unrelated face gallery data sets and use the other 5000 to retrieve them.
M = 48, efConstruction = 200, feature dim = 384.
Here are the results of HNSW:
gallery size: 1 million top[1]=0.772950 top[5]=0.818441 top[10]=0.836393 top[20]=0.850265 top[50]=0.866585 top[100]=0.880661 top[256]=0.894941 top[512]=0.906365 top[1024]=0.915749
gallery size: 5 million top[1]=0.672787 top[5]=0.714810 top[10]=0.728070 top[20]=0.741942 top[50]=0.758262 top[100]=0.767238 top[256]=0.777642 top[512]=0.785190 top[1024]=0.790698
Here is the search result of GPU flat for comparison:
gallery size: 5 million top[1]=0.725622 top[5]=0.780090 top[10]=0.801102 top[20]=0.821297 top[50]=0.847001 top[100]=0.861485 top[256]=0.879845 top[512]=0.893921 top[1024]=0.907181
Is there something wrong with me? If necessary, I can open my data sets.
I checked this discussion https://github.com/nmslib/hnswlib/issues/164. As to whether there are duplicate data, I use 5 million gallery data sets to retrieve themselves:
top 1: 4861229/5000000 = 0.972246 top 5: 4999996/5000000 = 0.999999 top 10: 5000000/5000000 = 1 top 20: 5000000/5000000 = 1 top 50: 5000000/5000000 = 1
It can be seen that there is a little duplicate data in my gallery data sets, but most of them still don't seem to be repeated. If there are other experimental details that need to be supplemented, I can provide corresponding experiments.
@Jowekk I am not sure I understand. The recall you measure is the recall of the full ML pipeline for the same person?
5000 pairs: paired images taken by 5000 different people. gallery data: completely unrelated to the above 5000 people.
I used 5000 images as batch queries to recall the other 5000 paired images. If top one hits 1000, then top[1] = 1000 / 5000 = 0.2 Did I describe it clearly?
@Jowekk I see, didn't notice you have brute force results.
I wonder, did you try to alter the ef
parameter (and what happens if you set it to say, 10000).
ef = 10000 is too big to meet our time requirements. When ef and gallery size are increased, a large number of random accesses will be caused because neighbors are randomly stored in memory. Bottlenecks accessed randomly allowed QPS to fall quickly. Do you think the number of neighbors scanned is too small? Here is a small experiment, the number of scans on neighbors in different gallery size:
gallery size | 10,000 | 100,000 | 1000,000 |
---|---|---|---|
scan num of neighbors | 9049 | 29881 | 49351 |
Looking on how recall changes with ef, while keeping other parameters fixed provides diagnostic information. E.g. if ef is extremely large and recall is low that hints the graph has disconnected components, etc
ok, this is a good idea worth trying. thank u