hnswlib
hnswlib copied to clipboard
searchKnn return wrong neighbor on toy example
Hello,
I have created some simple toy example, and was surprised from the NN i got from hnswlib. The data i was using is : taking 5 random vectors with L1 norm of 1, duplicating each of them 100 times and adding into hnsw index.
In the next step i choose one of the 5 vectors as input to searchKnn, and i expect to get one of the duplicates version of this vector as first NN. However, it isn't the case : when i query with vector 3 the first NN is one of the duplicates of vector 0.
Notes :
- the behaviour on mac & linux is different (on mac this test succeed)
- if i add some small random noise to each vector before adding to the index the test succeed.
Can someone explain this phenomenon ?
TEST_F(TestSuite, KNN) {
int DIM = 256;
int N = 5;
int MULTIPLY = 100;
std::shared_ptr<hnswlib::SpaceInterface<float>> space;
space.reset(new hnswlib::InnerProductSpace(DIM));
std::shared_ptr<hnswlib::HierarchicalNSW<float>> index;
index.reset(new hnswlib::HierarchicalNSW<float>(space.get(), 100000, 16, 200));
// create random data
std::vector<std::vector<float>> dataset;
for (int i = 0; i < N; i++) {
// generate random vector
std::vector<float> p(DIM);
for (std::vector<float>::iterator it = p.begin(); it != p.end(); it++) {
*it = RandomFloat(0.1);
}
// normalize vector again
float norm = 0.0;
for (auto x : p) norm += x*x;
norm = sqrt(norm);
for (int i=0; i < p.size(); i++) {
p[i] /= norm;
}
dataset.push_back(p);
for (int j = 0; j < MULTIPLY; j++) {
index->addPoint(reinterpret_cast<const char*>(dataset[i].data()), i*MULTIPLY + j);
}
}
// query with the same data
for (int i = 0; i < N; i++) {
std::priority_queue<std::pair<float, hnswlib::labeltype>> knn_result = index->searchKnn(
reinterpret_cast<const char*>(dataset[i].data()), 1);
std::pair<int, float> neighbor(knn_result.top().second, knn_result.top().first);
std::cout << i << " --> " << neighbor.first << ", distance=" << neighbor.second << std::endl;
EXPECT_TRUE(neighbor.first >= i*MULTIPLY and neighbor.first < (i+1)*MULTIPLY);
}
}
output i got (failure on the last query):
0 --> 6, distance=3.57628e-07
1 --> 108, distance=2.38419e-07
2 --> 201, distance=1.19209e-07
3 --> 6, distance=1.01034
This is due to data duplication. The number of links per element is fixed, so all links end up connecting to the same type of elements effectively creating a highly disconnected graph. This is mitigated by multi-layer structure, but it is very far from being perfect. Probably it is time to fix this. One thing we can try is to limit the connections in the heuristic that look like duplicates (it can be a space parameter, I think something like this was discussed before).
This is due to data duplication. The number of links per element is fixed, so all links end up connecting to the same type of elements effectively creating a highly disconnected graph. This is mitigated by multi-layer structure, but it is very far from being perfect. Probably it is time to fix this. One thing we can try is to limit the connections in the heuristic that look like duplicates (it can be a space parameter, I think something like this was discussed before).
Thank you for the response. Do you have any idea why i get different results on linux & mac ? On mac the test succeed, while on linux it fails.
I think they have different number generators/concurrency. So, probably, when you change the random_seed
, the results will be different.
I think we will be able to fix it in near future.