cuvs [TASK] Reuse `all_neighbors` APIs in CAGRA ACE build

The CAGRA augmented core extraction (ACE) build method introduced in PR #1404 supports building CAGRA indices on very large datasets that exceed GPU memory capacity. To this end, it partitions the dataset similar to the batched all_neighbors approach. This issue tracks the overlap and potential integration into ACE to minimize code duplication.

ace_get_partition_labels: This is similar to running the all_neighbors get_centroids_on_data_subsample and assign_clusters routines (see https://github.com/rapidsai/cuvs/pull/1404#discussion_r2411808609).

get_centroids_on_data_subsample runs balanced k-means on a subsample of the dataset to get centroids. This uses balanced kmeans::fit, which only supports the datatype float and int8_t . Centroids have to be of type float. The all neighbors implementation uses a generic template for both. This would force type float for both. We could make the centroids type explicit and convert the subsampled dataset to type float if other types are provided. What do you think @jinsol? Another difference is the number of samples, which we might need to add as an additional parameter.
assign_clusters assign each data point to top overlap_factor (2 for ACE) number of clusters. It uses brute_force::search which expects float or half. However, the main issue is that the global_nearest_cluster (partition labels in ACE notation) are expected as a matrix view of index type. This would be int64_t to match the expected extend type and we end up with twice the memory requirements. We could use int64_t during batched brute_force::search and then convert to a 32-bit index type. This would also require changes in the all neighbors implementation.

ace_create_forward_and_backward_lists has some overlap with the all_neighbors get_inverted_indices (see https://github.com/rapidsai/cuvs/pull/1404#discussion_r2411915915). After analyzing the overlap, I believe this routine differs significantly since it forms a single cluster from the overlap_factor = 2 clusters. ACE needs the primary and augmented partition independent though. We also have to separate ACE's in-memory and disk path and require a forward mapping. I think unifying these is not desirable.

Oct 15 '25 10:10 julianmi

Thanks @julianmi for leaving this issue.

For 1. It seems like ACE also uses balanced kmeans and currently converts the sampled data to float types so I think that would be fine!

Oct 22 '25 17:10 jinsolp

Marking another discussion: can we reuse all_neighbors::gpu_batched_build for the in memory path https://github.com/rapidsai/cuvs/pull/1404/files#r2418194555

Nov 02 '25 23:11 tfeher