scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

Gpu additions

Open LouisFaure opened this issue 3 years ago • 15 comments

This PR aims to add more GPU functionalities and to integrate more an exisiting one:

  • tl.draw_graph and tl.leiden can now both be GPU accelerated using rapids framework.
  • on pp.neighbors, now 'rapids' method allows for more metrics. Calculated distance does not need to be root squared anymore (See https://github.com/rapidsai/cuml/issues/1078#issuecomment-551284134). I also have slightly rearranged the code to integrate more 'rapids' into the general processing of neighbors, to ensure that distances and connectivities results between 'rapids' and 'umap' are the same.

LouisFaure avatar Dec 04 '20 14:12 LouisFaure

I am tacking the liberty of bumping this PR, as I have added GPU support for sc.pp.pca on the latest commit! (had to force push it to fix a big typo on the commit title)

LouisFaure avatar Mar 05 '21 16:03 LouisFaure

Dear @LouisFaure,

thank you very much for the high quality PR. A couple of questions:

  1. Do you think that we should check for whether GPUs are available if any of the GPU accelerated methods were chosen? This would allow us to exit more nicely if we were requesting GPU support but none were found
  2. I think that we should homogenize the parameter names for the method selection. Sometimes they are called 'method', sometimes 'flavor' and then you're also using 'device'. I myself am a fan of 'device' to switch between CPU and GPU implementations. However, then it would be unclear which method to use when several GPU accelerated algorithms for a task are implemented. Do you have better ideas?

Zethson avatar Apr 07 '21 21:04 Zethson

Hey, just wanted to comment here on why it's taken so long for a review. I'm personally not comfortable with having significant code in the package that we cannot test on CI. We're looking into this, but it's been slow going since it looks like we have to set this up and manage it on our own.

As far as I can tell this process is:

  • Put money into the azure account
  • Set up containers
  • Configure pipelines to use these containers (not sure if we can use the standard Tasks on "self hosted" containers)

@Zethson, since you're actually at the institute with the money you may have better luck moving the first step forward than I've had. Do you think you'd be able to look into this?

ivirshup avatar Apr 08 '21 05:04 ivirshup

@ivirshup fully agree. The CI must cover as much as possible. We actually have the same issue over at https://github.com/mlf-core/mlf-core

I can certainly get us the resources, but might not be able to implement it soonish. However, I would be interested in taking up this task. I'll create an issue and assign myself. But again, don't expect it soon.

Zethson avatar Apr 08 '21 08:04 Zethson

Thanks for your comments, I understand the struggle of implementing CI for GPU code!

@Zethson here are my answers to your questions:

  1. Instead of checking if a gpu is available, I would suggest to rather check if the related library is installed (depending on the method, it could be cugraph, cupy or cuml) since each of these libraries always require a GPU at installation and usage, I think using these as check would suffice.
  2. I agree with moving to the usage of 'device' as much as possible. It should be easily possible to rename "method"/"flavor" to "device" for tl.draw_graph, tl.leiden and tl.louvain, and use only "cpu"/"gpu" as choices as theses parameters would have only two choices anyway. In most case this would indeed remove the name of the python backend used, but one could instead mention it in the api/doc. pp.neighbors is a bit more tricky to handle, running it in gpu mode lead to a combination of distances/neighbors calculation with gpu/cuml backend and then connectivities calculations on cpu/umap backend, this could be solved if maintainers of cuml decide to allow the latter to be computed with cuml: https://github.com/rapidsai/cuml/issues/3123.

Since it will take time before CI can be implemented, I can just add the easy small changes proposed on 2. and let the PR open so you decide what to do later!

LouisFaure avatar Apr 09 '21 13:04 LouisFaure

@LouisFaure Great! While I agree with your comments and suggestions I think that for now you can save yourself the time to implement them since they are likely to run into further merge conflicts down the road. The GPU CI is certainly off weeks if not months. As soon as it's ready I would ping you again and we can get this PR ready.

Does this sound fine to you? Thanks again! Looking forward to GPU accelerated Scanpy.

Zethson avatar Apr 09 '21 16:04 Zethson

diffusion maps and TSNE also work with cupy and rapids. Diffusion Maps don't see a massive speedup maybe 2X on modern GPUs. TSNE sees a massive speedup. For larger anndata objects it's important to set up GPU memory management for both cupy and rapids packages. This becomes very import on hardware with low VRAM

Intron7 avatar Jan 31 '22 12:01 Intron7

For those interested in using the GPU accelerated functions leiden, draw_graph_fa, I have made them available on the following gist: https://gist.github.com/LouisFaure/9302aa140d7989a25ed2a44b1ce741e8

I have also included in that code load_mtx, which reads and convert mtx files into anndata using cudf. I tested on a 654Mo mtx containing 56621 cells x 20222 genes, I can obtain a 13X speedup (using RTX8000)!

image

I expect this to scale even better with higher number of cells. I could also add this wrapper into scanpy once CI is ready.

LouisFaure avatar Apr 22 '22 11:04 LouisFaure

Great @LouisFaure ! We have not forgotten your PR :)

Zethson avatar Apr 22 '22 11:04 Zethson

To me having the meta data (obs and var) in VRAM only makes sense for large GPUs like your RTX8000 or A100. I wrote a small anndata like object (https://github.com/Intron7/rapids_singlecell) for Prepossessing and the benefit of having everything in VRAM is rather small. Its better to just move the indexes around.

Intron7 avatar Apr 22 '22 12:04 Intron7

@Intron7 I think the aim here is indeed to not keep anything in VRAM anyway. In the code/functions I propose here, the data is only transiently stored in device memory for calculation and the resulting output is always transfered back to host once finished.

Moreover, I also think that loading a huge mtx file with a 4Go GPU is not impossible. From what I understood rmm should allow oversubscription on host RAM using the following command:

rmm.reinitialize(managed_memory=True)
cp.cuda.set_allocator(rmm.rmm_cupy_allocator)

I had a look at your code and GPU accelerated preprocessing functions would be also welcome in scanpy in my opinion! I feel that scale and regress_out could benefit from such speedup for example.

LouisFaure avatar Apr 22 '22 13:04 LouisFaure

Using RMM works but only to a certain extend. As far as I understand it you can oversubscribe VRAM to a maximum of 2X. If you go above that you’ll get a memory alloc error.

Intron7 avatar Apr 23 '22 10:04 Intron7

Here are some updates:

  • _fuzzy_simplicial_set from umap has been freshly exposed in the nightly version of cuml 22.06 (stable should be there in the coming weeks), so I did a quick implementation and now have a fully accelerated sc.pp.neighbors!
  • I also used this opportunity to introduce read_mtx_gpu function, which includes a dask_cudf backend for out of vram memory mtx reading.

I performed a speed comparison on a 100.000 cells dataset, running full simple pipeline from loading the mtx until UMAP/leiden:

image

The GPU accelerated code shows a 13X speedup compared to CPU based functions (tested on 12 CPU cores system)!

LouisFaure avatar May 26 '22 14:05 LouisFaure

Codecov Report

Merging #1533 (4d73886) into master (bd06cc3) will decrease coverage by 0.53%. The diff coverage is 30.28%.

@@            Coverage Diff             @@
##           master    #1533      +/-   ##
==========================================
- Coverage   71.82%   71.28%   -0.54%     
==========================================
  Files          98       98              
  Lines       11539    11647     +108     
==========================================
+ Hits         8288     8303      +15     
- Misses       3251     3344      +93     
Impacted Files Coverage Δ
scanpy/tools/_top_genes.py 0.00% <0.00%> (ø)
scanpy/readwrite.py 64.69% <5.00%> (-2.97%) :arrow_down:
scanpy/tools/_draw_graph.py 60.60% <7.14%> (-10.83%) :arrow_down:
scanpy/tools/_embedding_density.py 55.55% <17.64%> (-11.12%) :arrow_down:
scanpy/neighbors/__init__.py 73.66% <27.77%> (-1.79%) :arrow_down:
scanpy/tools/_leiden.py 62.31% <41.66%> (-23.10%) :arrow_down:
scanpy/preprocessing/_pca.py 88.88% <50.00%> (-6.32%) :arrow_down:
scanpy/tools/_rank_genes_groups.py 93.23% <71.42%> (-0.97%) :arrow_down:
scanpy/__init__.py 100.00% <100.00%> (ø)
scanpy/tools/_umap.py 74.24% <100.00%> (ø)
... and 1 more

codecov[bot] avatar May 26 '22 14:05 codecov[bot]

I created a PR to this branch to add GPU support for : *tl.rank_gene_groups with method='logreg' *tl.embedding_density *correlation_matrix *diffmap I added .layers support for pp.pca. This helps with the "Pearson Residuals" workflow. The default pca solver for device GPU is now "auto" I also fixed a bug in tl.rank_gene_groups with method='logreg' with selecting groups (eg. groups = ["2","1","5"]) that is currently still in scanpy. image

Intron7 avatar Jul 20 '22 08:07 Intron7

Hey @LouisFaure,

During the Hackathlon last week we talked again about this PR. For the time being we will keep GPU computing functionality out of scanpy and in rapids-singlecell. RSC is now tested with a CI solution. If you want to contribute to rapids-singlecell I would be very happy. Missing functions like Umap and Neighbors are currently getting updated and also ported to RSC.

Intron7 avatar Aug 09 '23 13:08 Intron7

Great I am looking at it now, RSC package looks very nice!

LouisFaure avatar Sep 06 '23 09:09 LouisFaure