cuvs Wrapper for all-neighbors knn graph building

Description

This PR introduces a wrapper for building all-neighbors knn graphs. The multi-GPU batched all-neighbors knn graph building algorithm is part of this PR, which enables all-neighbors graph building for datasets that do not fit on device memory.

Batching can be done by using n_clusters > 1. Multi-GPU feature is supported simply by using CUDA_VISIBLE_DEVICES. Further details below!

Related Issue

https://github.com/rapidsai/cuvs/issues/727

[Supported ANN algorithms]

IVFPQ: higher recall, takes longer to run
NN Descent: lower recall in general (and has unacceptable low recall issues for some data distributions), but faster to run. Also uses less memory (since it uses half precision)
Plan to add brute force in the future as a separate PR (only usable when n_clusters=1 because using batching means we cannot guarantee 100% accuracy)

[Supported data types] Currently only supports fp32 data types, but plan to explore and add fp16 data types in the future.

[Number of nearest clusters and total number of clusters] batch_ann::index has two important configurable parameters.

n_nearest_clusters: number of clusters each data is assigned to
n_clusters: total number of clusters

Some important notes about these two parameters;

the ratio of n_nearest_clusters / n_clusters determines device memory usage.
- Approximately (n_nearest_clusters / n_clusters ) * num_rows_in_entire_data number of rows will be put on device memory at once.
- E.g. between (n_nearest_clusters / n_clusters ) = 2/10 and 2/20, the latter will use less device memory.
larger n_nearest_clusters results in better accuracy of the final all-neighbors knn graph.
- E.g. With the similar device memory usages, (n_nearest_clusters / n_clusters) = 4/20 will have better accuracy than 2/10 at the cost of performance.
When n_clusters=1, does not perform batching.

How to Use

Can be used like the following

// in test.cpp file
batch_ann::index_params params;
params.n_clusters         = 10; // total number of clusters
params.n_nearest_clusters = 2; // number of clusters each data is assigned to

// Option 1: For usage with IVFPQ
params.graph_build_params = batch_ann::graph_build_params::ivf_pq_params{};
// Option 2: For usage with NN Descent
params.graph_build_params = batch_ann::graph_build_params::nn_descent_params{};

auto index = batch_ann::build(handle_, host_data_view, k, params);

the batch_ann::build internally runs on multi gpu or single gpu depending on visible cuda devices.

CUDA_VISIBLE_DEVICES=0 ./test.         // runs single GPU batching
CUDA_VISIBLE_DEVICES=0,1,2,3 ./test    // uses 4 GPUS for batching

Numbers

All experiments run on L40 GPUs [Large dataset to check performance and memory usage] Below is a result of running on a 269GB dataset (wiki all).

IVFPQ takes longer (because it goes through build - search -refine steps), and uses more memory. The reason for using more memory is because NN Descent puts data on device in half precision floating points, regardless of the given data type. Plan to test the impact of using fp16 IVFPQ on recall in the future.
Gains are better for IVFPQ because it does more work on the GPU compared to NN Descent.
With longer end to end runtime and more memory usage (unless it uses fp16), IVFPQ is still worth using because of recall issues with NN Descent.

[Smaller dataset to verify recall] It is impossible to generate ground truth all-neighbors graph for a large dataset, so below is the result on a 3.8GB dataset (Deep images) to check how n_nearest_clusters and n_clusters affects recall.

Recalls consistent regardless of how many GPUs are used
Note the recall differences between (n_nearest_clusters / n_clusters ) = 2/20 and 4/40. These two configurations use similar amount of device memory, but the latter has much better recall
Note that NN Descent recalls are generally lower than IVFPQ recalls (NN Descent results would have been much worse for other data distributions at this data size)
At this data size, adding more GPUs doesn't always help

Mar 25 '25 21:03 jinsolp

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Apr 08 '25 23:04 copy-pr-bot[bot]

/ok to test

Apr 17 '25 01:04 jinsolp

/ok to test

Apr 17 '25 16:04 jinsolp

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Apr 30 '25 01:04 copy-pr-bot[bot]

/merge

May 27 '25 15:05 cjnolet