[BUG] NN Descent has low recall for large datasets
Description
NN Descent shows low recall (compared to using brute force knn) for large datasets. This makes it difficult to scale up and out and use NN Descent for building knn graphs for other algorithms such as UMAP.
Result for dataset with 1000 features
The recall does not improve after a certain point even with a lot of iterations
Result for dataset with 100 features
Not as bad as the dataset above with 1000 features, but also shows low recall for dataset with smaller number of features too
Reproducing the bug
The experiments above were run on specifically this commit on raft's branch-24.08.
-
Change the
raft/cpp/test/neighbors/ann_nn_descent.cuhfile test input to test for larger datasets like belowindex_params.max_iterationsis set to 100 in the same file, and can be changed to test for larger number of iterations. The test file random generates data in thesetup()function. -
build the test
-
Run the test
./cpp/build/gtests/NEIGHBORS_ANN_NN_DESCENT_TEST --gtest_filter=AnnNNDescentTest/AnnNNDescentTestF_U32*
Some updates on this issue;
NN Descent uses graph_degree=k to set the node_degree, and the intermediate_graph_degree given by the user is used to set the internal_node_degree. This causes an issue because we need to build on larger node_degree and internal_node_degree for larger datasets.
Thus;
- directly using the given
graph_degree=kas thenode_degreedoesn't provide enough "degree" for a large dataset - however, the
node_degreehaving to scale with the dataset is also problematic in terms of memory usage.
Below is the impact of changing the node_degree and internal_node_degree on the recall