scanpy
scanpy copied to clipboard
Gpu additions
This PR aims to add more GPU functionalities and to integrate more an exisiting one:
-
tl.draw_graph
andtl.leiden
can now both be GPU accelerated using rapids framework. - on
pp.neighbors
, now 'rapids' method allows for more metrics. Calculated distance does not need to be root squared anymore (See https://github.com/rapidsai/cuml/issues/1078#issuecomment-551284134). I also have slightly rearranged the code to integrate more 'rapids' into the general processing of neighbors, to ensure that distances and connectivities results between 'rapids' and 'umap' are the same.
I am tacking the liberty of bumping this PR, as I have added GPU support for sc.pp.pca
on the latest commit!
(had to force push it to fix a big typo on the commit title)
Dear @LouisFaure,
thank you very much for the high quality PR. A couple of questions:
- Do you think that we should check for whether GPUs are available if any of the GPU accelerated methods were chosen? This would allow us to exit more nicely if we were requesting GPU support but none were found
- I think that we should homogenize the parameter names for the method selection. Sometimes they are called 'method', sometimes 'flavor' and then you're also using 'device'. I myself am a fan of 'device' to switch between CPU and GPU implementations. However, then it would be unclear which method to use when several GPU accelerated algorithms for a task are implemented. Do you have better ideas?
Hey, just wanted to comment here on why it's taken so long for a review. I'm personally not comfortable with having significant code in the package that we cannot test on CI. We're looking into this, but it's been slow going since it looks like we have to set this up and manage it on our own.
As far as I can tell this process is:
- Put money into the azure account
- Set up containers
- Configure pipelines to use these containers (not sure if we can use the standard Tasks on "self hosted" containers)
@Zethson, since you're actually at the institute with the money you may have better luck moving the first step forward than I've had. Do you think you'd be able to look into this?
@ivirshup fully agree. The CI must cover as much as possible. We actually have the same issue over at https://github.com/mlf-core/mlf-core
I can certainly get us the resources, but might not be able to implement it soonish. However, I would be interested in taking up this task. I'll create an issue and assign myself. But again, don't expect it soon.
Thanks for your comments, I understand the struggle of implementing CI for GPU code!
@Zethson here are my answers to your questions:
- Instead of checking if a gpu is available, I would suggest to rather check if the related library is installed (depending on the method, it could be cugraph, cupy or cuml) since each of these libraries always require a GPU at installation and usage, I think using these as check would suffice.
- I agree with moving to the usage of 'device' as much as possible. It should be easily possible to rename "method"/"flavor" to "device" for
tl.draw_graph
,tl.leiden
andtl.louvain
, and use only "cpu"/"gpu" as choices as theses parameters would have only two choices anyway. In most case this would indeed remove the name of the python backend used, but one could instead mention it in the api/doc.pp.neighbors
is a bit more tricky to handle, running it in gpu mode lead to a combination of distances/neighbors calculation with gpu/cuml backend and then connectivities calculations on cpu/umap backend, this could be solved if maintainers of cuml decide to allow the latter to be computed with cuml: https://github.com/rapidsai/cuml/issues/3123.
Since it will take time before CI can be implemented, I can just add the easy small changes proposed on 2. and let the PR open so you decide what to do later!
@LouisFaure Great! While I agree with your comments and suggestions I think that for now you can save yourself the time to implement them since they are likely to run into further merge conflicts down the road. The GPU CI is certainly off weeks if not months. As soon as it's ready I would ping you again and we can get this PR ready.
Does this sound fine to you? Thanks again! Looking forward to GPU accelerated Scanpy.
diffusion maps and TSNE also work with cupy and rapids. Diffusion Maps don't see a massive speedup maybe 2X on modern GPUs. TSNE sees a massive speedup. For larger anndata objects it's important to set up GPU memory management for both cupy and rapids packages. This becomes very import on hardware with low VRAM
For those interested in using the GPU accelerated functions leiden, draw_graph_fa, I have made them available on the following gist: https://gist.github.com/LouisFaure/9302aa140d7989a25ed2a44b1ce741e8
I have also included in that code load_mtx
, which reads and convert mtx files into anndata using cudf. I tested on a 654Mo mtx containing 56621 cells x 20222 genes, I can obtain a 13X speedup (using RTX8000)!
I expect this to scale even better with higher number of cells. I could also add this wrapper into scanpy once CI is ready.
Great @LouisFaure ! We have not forgotten your PR :)
To me having the meta data (obs and var) in VRAM only makes sense for large GPUs like your RTX8000 or A100. I wrote a small anndata like object (https://github.com/Intron7/rapids_singlecell) for Prepossessing and the benefit of having everything in VRAM is rather small. Its better to just move the indexes around.
@Intron7 I think the aim here is indeed to not keep anything in VRAM anyway. In the code/functions I propose here, the data is only transiently stored in device memory for calculation and the resulting output is always transfered back to host once finished.
Moreover, I also think that loading a huge mtx file with a 4Go GPU is not impossible. From what I understood rmm should allow oversubscription on host RAM using the following command:
rmm.reinitialize(managed_memory=True)
cp.cuda.set_allocator(rmm.rmm_cupy_allocator)
I had a look at your code and GPU accelerated preprocessing functions would be also welcome in scanpy in my opinion! I feel that scale
and regress_out
could benefit from such speedup for example.
Using RMM works but only to a certain extend. As far as I understand it you can oversubscribe VRAM to a maximum of 2X. If you go above that you’ll get a memory alloc error.
Here are some updates:
-
_fuzzy_simplicial_set
from umap has been freshly exposed in the nightly version of cuml 22.06 (stable should be there in the coming weeks), so I did a quick implementation and now have a fully accelerated sc.pp.neighbors! - I also used this opportunity to introduce
read_mtx_gpu
function, which includes a dask_cudf backend for out of vram memory mtx reading.
I performed a speed comparison on a 100.000 cells dataset, running full simple pipeline from loading the mtx until UMAP/leiden:
The GPU accelerated code shows a 13X speedup compared to CPU based functions (tested on 12 CPU cores system)!
Codecov Report
Merging #1533 (4d73886) into master (bd06cc3) will decrease coverage by
0.53%
. The diff coverage is30.28%
.
@@ Coverage Diff @@
## master #1533 +/- ##
==========================================
- Coverage 71.82% 71.28% -0.54%
==========================================
Files 98 98
Lines 11539 11647 +108
==========================================
+ Hits 8288 8303 +15
- Misses 3251 3344 +93
Impacted Files | Coverage Δ | |
---|---|---|
scanpy/tools/_top_genes.py | 0.00% <0.00%> (ø) |
|
scanpy/readwrite.py | 64.69% <5.00%> (-2.97%) |
:arrow_down: |
scanpy/tools/_draw_graph.py | 60.60% <7.14%> (-10.83%) |
:arrow_down: |
scanpy/tools/_embedding_density.py | 55.55% <17.64%> (-11.12%) |
:arrow_down: |
scanpy/neighbors/__init__.py | 73.66% <27.77%> (-1.79%) |
:arrow_down: |
scanpy/tools/_leiden.py | 62.31% <41.66%> (-23.10%) |
:arrow_down: |
scanpy/preprocessing/_pca.py | 88.88% <50.00%> (-6.32%) |
:arrow_down: |
scanpy/tools/_rank_genes_groups.py | 93.23% <71.42%> (-0.97%) |
:arrow_down: |
scanpy/__init__.py | 100.00% <100.00%> (ø) |
|
scanpy/tools/_umap.py | 74.24% <100.00%> (ø) |
|
... and 1 more |
I created a PR to this branch to add GPU support for :
*tl.rank_gene_groups
with method='logreg'
*tl.embedding_density
*correlation_matrix
*diffmap
I added .layers
support for pp.pca
. This helps with the "Pearson Residuals" workflow.
The default pca solver for device GPU is now "auto"
I also fixed a bug in tl.rank_gene_groups
with method='logreg'
with selecting groups (eg. groups = ["2","1","5"]) that is currently still in scanpy.
Hey @LouisFaure,
During the Hackathlon last week we talked again about this PR. For the time being we will keep GPU computing functionality out of scanpy and in rapids-singlecell. RSC is now tested with a CI solution. If you want to contribute to rapids-singlecell I would be very happy. Missing functions like Umap and Neighbors are currently getting updated and also ported to RSC.
Great I am looking at it now, RSC package looks very nice!