cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[FEA] Changing `COO` `Index_Type` in UMAP to prevent overflow when running with large datasets

Open jinsolp opened this issue 1 year ago • 0 comments

Description

UMAP cannot run large datasets right now because of an overflow issue. raft::sparse::COO defaults to using int for its Index_Type and this becomes a problem.

When this issue is solved, we need to update UMAPAlgo::FuzzySimplSet::ML::run() to take COO with an Index_Type other than int.

Details

Specifically, coo_symmetrize (raft function called from UMAPAlgo::FuzzySimplSet::ML::run()) allocates nnz * 2 space on device. For a large dataset (e.g. 88M samples with knn graph degree 16) this value is larger than max int (88M * 16 * 2 > INT_MAX).

jinsolp avatar Aug 06 '24 00:08 jinsolp