cuml
cuml copied to clipboard
[FEA] Changing `COO` `Index_Type` in UMAP to prevent overflow when running with large datasets
Description
UMAP cannot run large datasets right now because of an overflow issue.
raft::sparse::COO defaults to using int for its Index_Type and this becomes a problem.
When this issue is solved, we need to update UMAPAlgo::FuzzySimplSet::ML::run() to take COO with an Index_Type other than int.
Details
Specifically, coo_symmetrize (raft function called from UMAPAlgo::FuzzySimplSet::ML::run()) allocates nnz * 2 space on device. For a large dataset (e.g. 88M samples with knn graph degree 16) this value is larger than max int (88M * 16 * 2 > INT_MAX).