seurat icon indicating copy to clipboard operation
seurat copied to clipboard

Add parallel support AddModuleScore

Open samuel-marsh opened this issue 2 years ago • 2 comments

Hi Seurat Team,

Just a PR based on discussion in previous PR request #6348 to add support for AddModuleScore parallel processing. My solution uses future/future.apply packages so no additional dependencies.

Quick single test (can run more realistic benchmark with bench package but don't feel it's really necessary) adding 100 scores of 100 genes each to object with ~47,000 nuclei and ~28,000 features sequential vs parallel with 4 cores was 1.7 times faster.

library(tidyverse)
library(Seurat)
library(scCustomize)
library(qs)
library(tictoc)
library(future)
library(future.apply)

test <- qread("marsh.qs")

# Extract Gene Lists from All Objects
all_genes_marsh <- rownames(test@assays$RNA)

# Create 100 random gene lists of 100 genes
random_gene_sets_micro <- lapply(vector("list", 100), function(x){sample(all_genes_marsh, length(1:100))})

tic()
test <- AddModuleScore(object = test, features = random_gene_sets_micro)
toc()
429.236 sec elapsed

# restart R

library(tidyverse)
library(Seurat)
library(scCustomize)
library(qs)
library(tictoc)
library(future)
library(future.apply)

plan("multisession", workers = 4)
options(future.globals.maxSize = 3000 * 1024^2)

test <- qread("marsh.qs")

# Extract Gene Lists from All Objects
all_genes_marsh <- rownames(test@assays$RNA)

# Create 100 random gene lists of 100 genes
random_gene_sets_micro <- lapply(vector("list", 100), function(x){sample(all_genes_marsh, length(1:100))})

tic()
test <- AddModuleScore(object = test, features = random_gene_sets_micro)
toc()
251.93 sec elapsed

One thing I did debate and it's up to you is whether to add additional function parameter specifying parallel processing and make the internal function check something like this:

 if (nbrOfWorkers() > 1 && is.TRUE(parallel) 

The reason being that the gains with parallel processing with future for this function are most useful with large numbers of gene lists. However, if just adding single gene list or couple it's probably slightly faster to run normally. I left out in PR to keep everything the same but if this is something you think would be helpful I can easily add.

Thanks! Sam

p.s. tagging author of original PR here so he can follow this @scottgigante

samuel-marsh avatar Aug 31 '22 14:08 samuel-marsh

Thanks @samuel-marsh for the quick work! I don't think parallel=TRUE is necessary because a user could always use plan(sequential).

scottgigante avatar Aug 31 '22 15:08 scottgigante

agreed though some people set and forget at top of script. Overall I lean towards not adding extra param too.

samuel-marsh avatar Aug 31 '22 16:08 samuel-marsh