[BUG] NN Descent has low recall for large datasets

Open jinsolp opened this issue 1 year ago • 1 comments

Description

NN Descent shows low recall (compared to using brute force knn) for large datasets. This makes it difficult to scale up and out and use NN Descent for building knn graphs for other algorithms such as UMAP.

Result for dataset with 1000 features The recall does not improve after a certain point even with a lot of iterations Screenshot 2024-06-27 at 1 18 27 PM

Result for dataset with 100 features Not as bad as the dataset above with 1000 features, but also shows low recall for dataset with smaller number of features too Screenshot 2024-06-27 at 1 23 55 PM

Reproducing the bug

The experiments above were run on specifically this commit on raft's branch-24.08.

Change the raft/cpp/test/neighbors/ann_nn_descent.cuh file test input to test for larger datasets like below index_params.max_iterations is set to 100 in the same file, and can be changed to test for larger number of iterations. The test file random generates data in the setup() function.
build the test
Run the test ./cpp/build/gtests/NEIGHBORS_ANN_NN_DESCENT_TEST --gtest_filter=AnnNNDescentTest/AnnNNDescentTestF_U32*

Jun 27 '24 20:06 jinsolp

Some updates on this issue;

NN Descent uses graph_degree=k to set the node_degree, and the intermediate_graph_degree given by the user is used to set the internal_node_degree. This causes an issue because we need to build on larger node_degree and internal_node_degree for larger datasets. Thus;

directly using the given graph_degree=k as the node_degree doesn't provide enough "degree" for a large dataset
however, the node_degree having to scale with the dataset is also problematic in terms of memory usage.

Below is the impact of changing the node_degree and internal_node_degree on the recall

Mar 13 '25 00:03 jinsolp