hnswlib icon indicating copy to clipboard operation
hnswlib copied to clipboard

searchKnn return wrong neighbor on toy example

Open yossibiton opened this issue 4 years ago • 3 comments

Hello,

I have created some simple toy example, and was surprised from the NN i got from hnswlib. The data i was using is : taking 5 random vectors with L1 norm of 1, duplicating each of them 100 times and adding into hnsw index.

In the next step i choose one of the 5 vectors as input to searchKnn, and i expect to get one of the duplicates version of this vector as first NN. However, it isn't the case : when i query with vector 3 the first NN is one of the duplicates of vector 0.

Notes :

  1. the behaviour on mac & linux is different (on mac this test succeed)
  2. if i add some small random noise to each vector before adding to the index the test succeed.

Can someone explain this phenomenon ?

TEST_F(TestSuite, KNN) {
    int DIM = 256;
    int N = 5;
    int MULTIPLY = 100;

    std::shared_ptr<hnswlib::SpaceInterface<float>> space;
    space.reset(new hnswlib::InnerProductSpace(DIM));
    std::shared_ptr<hnswlib::HierarchicalNSW<float>> index;
    index.reset(new hnswlib::HierarchicalNSW<float>(space.get(), 100000, 16, 200));

    // create random data
    std::vector<std::vector<float>> dataset;
    for (int i = 0; i < N; i++) {
        // generate random vector
        std::vector<float> p(DIM);
        for (std::vector<float>::iterator it = p.begin(); it != p.end(); it++) {
            *it = RandomFloat(0.1);
        }
        // normalize vector again
        float norm = 0.0;
        for (auto x : p) norm += x*x;
        norm = sqrt(norm);
        for (int i=0; i < p.size(); i++) {
            p[i] /= norm;
        }
        dataset.push_back(p);

        for (int j = 0; j < MULTIPLY; j++) {
            index->addPoint(reinterpret_cast<const char*>(dataset[i].data()), i*MULTIPLY + j);
        }
    }
    // query with the same data
    for (int i = 0; i < N; i++) {
        std::priority_queue<std::pair<float, hnswlib::labeltype>> knn_result = index->searchKnn(
                reinterpret_cast<const char*>(dataset[i].data()), 1);
        std::pair<int, float> neighbor(knn_result.top().second, knn_result.top().first);
        std::cout << i << " --> " << neighbor.first << ", distance=" << neighbor.second << std::endl;
        EXPECT_TRUE(neighbor.first >= i*MULTIPLY and neighbor.first < (i+1)*MULTIPLY);
    }
}

output i got (failure on the last query):

0 --> 6, distance=3.57628e-07
1 --> 108, distance=2.38419e-07
2 --> 201, distance=1.19209e-07
3 --> 6, distance=1.01034

yossibiton avatar Dec 06 '20 16:12 yossibiton

This is due to data duplication. The number of links per element is fixed, so all links end up connecting to the same type of elements effectively creating a highly disconnected graph. This is mitigated by multi-layer structure, but it is very far from being perfect. Probably it is time to fix this. One thing we can try is to limit the connections in the heuristic that look like duplicates (it can be a space parameter, I think something like this was discussed before).

yurymalkov avatar Dec 06 '20 19:12 yurymalkov

This is due to data duplication. The number of links per element is fixed, so all links end up connecting to the same type of elements effectively creating a highly disconnected graph. This is mitigated by multi-layer structure, but it is very far from being perfect. Probably it is time to fix this. One thing we can try is to limit the connections in the heuristic that look like duplicates (it can be a space parameter, I think something like this was discussed before).

Thank you for the response. Do you have any idea why i get different results on linux & mac ? On mac the test succeed, while on linux it fails.

yossibiton avatar Dec 07 '20 15:12 yossibiton

I think they have different number generators/concurrency. So, probably, when you change the random_seed, the results will be different. I think we will be able to fix it in near future.

yurymalkov avatar Dec 07 '20 21:12 yurymalkov