cuml Proposal: Accelerate Non-Negative Matrix Factorization (NMF) needed for Dimensionality Reduction

Is your feature request related to a problem? Please describe.

We wish to use cuML to accelerate our experiment on Identifying gene expression programs of cell-type identity and cellular activity. Specifically NMF algorithm to decompose matrices. Currently our experiment is processed using CPU, the code is base on sk-learn.

Describe the solution you'd like

We would like to have NMF algorithm in cuml.decomposition module.

Describe alternatives you've considered

We've tried to write our own pytorch based NMF.

Additional context

We would like to contribute to the cuML package by contributing the C++ algorithm (if it's not implemented) and the python wrapper class.

Apr 26 '25 06:04 aaalce

@aaalce Thank you very much for your proposal! We would be happy to help you with contributing the implementation. Did you already make an assessment at which algorithmic components would need to be implemented and which could be re-used, e.g., from RAFT?

Apr 28 '25 15:04 csadorf

@aaalce,

This has come up several times in the past, and there's been a few different implementations (ALS, logistic factorization) that we've worked on. Are you familiar with the Implicit library? It finds pretty great speedups on the GPU and the author works for NVIDIA. The availability of the Implicit library is one of the reasons we've never hosted our own implementation. Is there an objective function or NMF variant above and beyind what's available in Implicit?

On a slightly related topic- are you interested in NMF for matrix completion or for dimensionality reduction? Is this for single-cell RNA tertiary analysis? Not sure if this would be of interest but we have a repository of GPU-accelerated single-cell analysis notebooks that might be of interest here.

Apr 28 '25 16:04 cjnolet

@aaalce what do you think of the feedback from Corey? Would they solve your problem?

May 07 '25 12:05 betatim

Hi @aaalce, just checking in to see if you're still interested in contributing this feature. We appreciate your proposal and the discussion so far. If you’ve had a chance to explore the Implicit library or the single-cell notebooks we mentioned, we’d love to hear your thoughts.

If we don’t hear back in the next couple of weeks, we might consider closing this issue for now, but please feel free to reopen or continue the discussion anytime.

Thanks again!

May 19 '25 09:05 viclafargue

Hi all, Thank you for the follow up! After investigating both the Implicit library and the RAPIDS single-cell examples, we've determined that neither fully addresses our specific use case:

Regarding Implicit library: While it provides GPU-accelerated matrix factorization, it doesn't appear to directly implement the Non-Negative Matrix Factorization (NMF) with the specific objective functions (Frobenius norm or Kullback-Leibler divergence) and solvers (e.g., Coordinate Descent, HALS) that are part of our flow for gene expression program (GEP) extraction.
Regarding RAPIDS single-cell examples: We recognise this is a powerful scRNA-seq preprocessing and initial dimensionality reduction steps acceleration. However, our focus is more on the downstream analysis step of extracting interpretable gene expression programs (GEPs), which NMF can be more helpful. We've observed in our CPU-based experiments, NMF provides more interpretable representations of cellular functional states or identity features, compared to other methods like PCA for this type of data, largely due to its non-negativity constraints which align well with the biological context of gene expression. Our matrices are often large, highly sparse, and non-negative.

Therefore, we still consider it would be a valuable addition for the bioinformatics community, and very interested in contributing an NMF implementation to cuml.decomposition.

To address the questions about algorithmic components: We would like to implement a.initialization strategies (random, NNDSVD variants) b.two primary solvers: HALS (Hierarchical Alternating Least Squares) Coordinate Descent solver For each solver contains these elements:

GPU Kernels with NMF Update Rules:

For Coordinate Descent: kernels implementing the multiplicative update rules
For HALS: kernels implementing the alternating least squares updates

Sparse Matrix Handling
Regularization Options (L1/L2) c.Convergence Monitoring

We have not finish the assessment which of the RAFTs we need, those are some we think will be helpful:

raft::linalg::gemm and raft::sparse::linalg::spmm for matrix operations - - raft::linalg::svd for initialization
raft::device_csr_matrix for sparse matrix handling
raft::resources for resource management

May 19 '25 12:05 aaalce

Thank you @aaalce for the detailed description and we are excited that you are interested in contributing! Please let us know of any further questions you may have as you begin this, we are happy to help.

May 19 '25 19:05 aamijar

@aaalce hi! Just checking in if you were able to get started on an implementation or if you need any help from our side?

Aug 11 '25 16:08 divyegala

For a GPU enabled NNMF - I was wondering if existing packages built on top of pytorch or jax would satisfy your use case?

There is a pytorch implementation of NMF here https://github.com/yoyolicoris/pytorch-NMF, which claims large speedups on the GPU versus the implementation in sklearn.decomposition.NMF: .

There is also an example of nmf in the jaxopt repository that might be worth taking a look at as well https://jaxopt.github.io/stable/auto_examples/constrained/nmf.html

Aug 14 '25 16:08 benfred