pyglass icon indicating copy to clipboard operation
pyglass copied to clipboard

Fix the problem of low recall rate of glass(hnsw) in multi-core environment

Open cwj0bzxg opened this issue 1 year ago • 4 comments

This PR is to fix the issue mentioned in #10

In the Graph structure of glass (hnsw), the neighbor's internal ID is stored in the neighbor list, and the internal ID is used to find the corresponding neighbor list. I noticed that you can store the neighbor's external ID in the neighbor list and use the external ID to access the corresponding neighbor list. These two methods are equivalent, but the latter method can avoid the problem of low recall rate caused by inconsistency between internal ID and external ID.

There is an example in the following branch: https://github.com/cwj0bzxg/pyglass/blob/fix_bug_deep10M/main.cpp

cwj0bzxg avatar May 14 '24 02:05 cwj0bzxg

Thanks for fixing this! Unfortunately, it looks like this version is non-deterministic due to some kind of a concurrency issue in Build(). The following example code is deterministic, but if you take out the with threadpoolctl line, it becomes non-deterministic:

import glassppy as glass
import numpy as np
import threadpoolctl

n, d = 10000, 128
np.random.seed(0)
X = np.random.randn(n, d)
Y = np.random.randn(d)

index = glass.Index(index_type="HNSW", dim=d, metric="L2", R=32, L=50)
with threadpoolctl.threadpool_limits(limits=1, user_api='openmp'):
    graph = index.build(X)

searcher = glass.Searcher(graph=graph, data=X, metric="L2", level=1)
searcher.set_ef(32)
ret = searcher.search(query=Y, k=10)
print(ret)

Wainberg avatar Jul 02 '24 18:07 Wainberg

This PR is to fix the issue mentioned in #10

In the Graph structure of glass (hnsw), the neighbor's internal ID is stored in the neighbor list, and the internal ID is used to find the corresponding neighbor list. I noticed that you can store the neighbor's external ID in the neighbor list and use the external ID to access the corresponding neighbor list. These two methods are equivalent, but the latter method can avoid the problem of low recall rate caused by inconsistency between internal ID and external ID.

There is an example in the following branch: https://github.com/cwj0bzxg/pyglass/blob/fix_bug_deep10M/main.cpp 请问一下,你这里面的召回率精度有提升吗?我用main.cpp的例子跑出来的recall=81%左右?

handsomeZhuang avatar Sep 02 '24 03:09 handsomeZhuang

请问一下,你这里面的召回率精度有提升吗?我用main.cpp的例子跑出来的recall=81%左右?

The recall is improved compared to the original. I just tested it. For the deep10m dataset, when ef=500, recall=97.8973%, and when ef=1000, recall=99.2578%. The deep10m dataset was downloaded through create_dataset.py in https://github.com/harsha-simhadri/big-ann-benchmarks.

cwj0bzxg avatar Sep 04 '24 04:09 cwj0bzxg

请问一下,你这里面的召回率精度有提升吗?我用main.cpp的例子跑出来的recall=81%左右?

The recall is improved compared to the original. I just tested it. For the deep10m dataset, when ef=500, recall=97.8973%, and when ef=1000, recall=99.2578%. The deep10m dataset was downloaded through create_dataset.py in https://github.com/harsha-simhadri/big-ann-benchmarks.

我这边测试数据集是SIFT1M 128的维度,ef=1000,精度约91%,麻烦可以发一下你那边build函数的其他参数吗?比如:index = std::make_uniqueglass::HNSW(dim, "L2", X,Y)中的X和Y,以及Optimize(Z)的参数Z吗?

handsomeZhuang avatar Sep 04 '24 12:09 handsomeZhuang