pyglass
pyglass copied to clipboard
Fix the problem of low recall rate of glass(hnsw) in multi-core environment
This PR is to fix the issue mentioned in #10
In the Graph structure of glass (hnsw), the neighbor's internal ID is stored in the neighbor list, and the internal ID is used to find the corresponding neighbor list. I noticed that you can store the neighbor's external ID in the neighbor list and use the external ID to access the corresponding neighbor list. These two methods are equivalent, but the latter method can avoid the problem of low recall rate caused by inconsistency between internal ID and external ID.
There is an example in the following branch: https://github.com/cwj0bzxg/pyglass/blob/fix_bug_deep10M/main.cpp
Thanks for fixing this! Unfortunately, it looks like this version is non-deterministic due to some kind of a concurrency issue in Build(). The following example code is deterministic, but if you take out the with threadpoolctl line, it becomes non-deterministic:
import glassppy as glass
import numpy as np
import threadpoolctl
n, d = 10000, 128
np.random.seed(0)
X = np.random.randn(n, d)
Y = np.random.randn(d)
index = glass.Index(index_type="HNSW", dim=d, metric="L2", R=32, L=50)
with threadpoolctl.threadpool_limits(limits=1, user_api='openmp'):
graph = index.build(X)
searcher = glass.Searcher(graph=graph, data=X, metric="L2", level=1)
searcher.set_ef(32)
ret = searcher.search(query=Y, k=10)
print(ret)
This PR is to fix the issue mentioned in #10
In the Graph structure of glass (hnsw), the neighbor's internal ID is stored in the neighbor list, and the internal ID is used to find the corresponding neighbor list. I noticed that you can store the neighbor's external ID in the neighbor list and use the external ID to access the corresponding neighbor list. These two methods are equivalent, but the latter method can avoid the problem of low recall rate caused by inconsistency between internal ID and external ID.
There is an example in the following branch: https://github.com/cwj0bzxg/pyglass/blob/fix_bug_deep10M/main.cpp 请问一下,你这里面的召回率精度有提升吗?我用main.cpp的例子跑出来的recall=81%左右?
请问一下,你这里面的召回率精度有提升吗?我用main.cpp的例子跑出来的recall=81%左右?
The recall is improved compared to the original. I just tested it. For the deep10m dataset, when ef=500, recall=97.8973%, and when ef=1000, recall=99.2578%. The deep10m dataset was downloaded through create_dataset.py in https://github.com/harsha-simhadri/big-ann-benchmarks.
请问一下,你这里面的召回率精度有提升吗?我用main.cpp的例子跑出来的recall=81%左右?
The recall is improved compared to the original. I just tested it. For the deep10m dataset, when ef=500, recall=97.8973%, and when ef=1000, recall=99.2578%. The deep10m dataset was downloaded through create_dataset.py in https://github.com/harsha-simhadri/big-ann-benchmarks.
我这边测试数据集是SIFT1M 128的维度,ef=1000,精度约91%,麻烦可以发一下你那边build函数的其他参数吗?比如:index = std::make_uniqueglass::HNSW(dim, "L2", X,Y)中的X和Y,以及Optimize(Z)的参数Z吗?