clustermq jobs stall when using Seurat::ScaleData
It appears that various multi-processing seurat commands that use the future R package (e.g., ScaleData) cause clustermq to "stall", and the jobs never complete.
A reprex:
# libraries
library(Seurat)
library(SeuratData)
library(clustermq)
options(clustermq.scheduler = "slurm")
# load an example dataset (seurat object)
InstallData("pbmc3k")
pbmc = LoadData("pbmc3k", type = "pbmc3k.final")
# subset
num_cells = 100
num_features = 5000
pbmc = subset(pbmc, cells = colnames(pbmc)[1:num_cells], features = rownames(pbmc)[1:num_features])
# clustermq test function
fx = function(seurat_obj) {
library(Seurat)
seurat_obj %>% ScaleData()
}
ret = Q(fx, seurat_obj=c(pbmc), pkgs=c("Seurat"), n_jobs=1, job_size=1, memory=12 * 1024)
The SLURM cluster job stalls, but will quickly completes successfully if seurat_obj %>% ScaleData() is replaced with simply return(seurat_obj).
If pbmc %>% ScaleData() is used instead of Q(fx, ...) (no SLURM job), then the ScaleData process completes in just a couple of seconds.
I'm guessing this issue is due to how ScaleData is parallelized.
The following does work:
# pre-process the dataset
seurat_preprocess = function(seurat_obj, dims=1:30) {
# load libraries
library(dplyr)
library(Seurat)
# pre-process seurat object
seurat_obj %>%
RunPCA() %>%
FindNeighbors(dims=dims) %>%
FindClusters() %>%
RunUMAP(dims=dims)
}
pbmc = Q(
seurat_preprocess,
seurat_obj=c(pbmc %>% NormalizeData() %>% ScaleData() %>% FindVariableFeatures()),
const=list(dims=1:30),
pkgs=c("dplyr", "Seurat"),
n_jobs=1,
job_size=1,
template = list(
memory = 12 * 1024,
log_file = "clustermq.log"
)
)[[1]]
pbmc
...so it's likely due to how ScaleData parallelizes data processing.
The same "stalling" occurs with SCTransform, which I believe still uses ScaleData under the hood.
Any ideas?
sessionInfo
R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS/LAPACK: /home/nickyoungblut/miniforge3/envs/seurat-v5/lib/libopenblasp-r0.3.27.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] clustermq_0.9.4 ggplot2_3.5.1 pbmc3k.SeuratData_3.1.4
[4] ifnb.SeuratData_3.1.0 SeuratData_0.2.2.9001 Seurat_5.1.0
[7] SeuratObject_5.0.2 sp_2.1-4 dplyr_1.1.4
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 jsonlite_1.8.8
[3] magrittr_2.0.3 spatstat.utils_3.0-4
[5] zlibbioc_1.48.0 vctrs_0.6.5
[7] ROCR_1.0-11 DelayedMatrixStats_1.24.0
[9] spatstat.explore_3.2-6 RCurl_1.98-1.14
[11] base64enc_0.1-3 S4Arrays_1.2.0
[13] htmltools_0.5.8.1 SparseArray_1.2.2
[15] sctransform_0.4.1 parallelly_1.37.1
[17] KernSmooth_2.23-24 htmlwidgets_1.6.4
[19] ica_1.0-3 plyr_1.8.9
[21] plotly_4.10.4 zoo_1.8-12
[23] uuid_1.2-0 igraph_2.0.3
[25] mime_0.12 lifecycle_1.0.4
[27] pkgconfig_2.0.3 Matrix_1.6-5
[29] R6_2.5.1 fastmap_1.2.0
[31] GenomeInfoDbData_1.2.11 MatrixGenerics_1.14.0
[33] fitdistrplus_1.1-11 future_1.33.2
[35] shiny_1.8.1.1 digest_0.6.35
[37] colorspace_2.1-0 patchwork_1.2.0
[39] S4Vectors_0.40.2 tensor_1.5
[41] RSpectra_0.16-1 irlba_2.3.5.1
[43] GenomicRanges_1.54.1 progressr_0.14.0
[45] fansi_1.0.6 spatstat.sparse_3.0-3
[47] httr_1.4.7 polyclip_1.10-6
[49] abind_1.4-5 compiler_4.3.3
[51] withr_3.0.0 fastDummies_1.7.3
[53] MASS_7.3-60 DelayedArray_0.28.0
[55] rappdirs_0.3.3 tools_4.3.3
[57] lmtest_0.9-40 httpuv_1.6.15
[59] future.apply_1.11.2 goftest_1.2-3
[61] glmGamPoi_1.14.0 glue_1.7.0
[63] nlme_3.1-164 promises_1.3.0
[65] grid_4.3.3 pbdZMQ_0.3-11
[67] Rtsne_0.17 cluster_2.1.6
[69] reshape2_1.4.4 generics_0.1.3
[71] gtable_0.3.5 spatstat.data_3.0-4
[73] tidyr_1.3.1 data.table_1.15.2
[75] XVector_0.42.0 utf8_1.2.4
[77] BiocGenerics_0.48.1 spatstat.geom_3.2-9
[79] RcppAnnoy_0.0.22 ggrepel_0.9.5
[81] RANN_2.6.1 pillar_1.9.0
[83] stringr_1.5.1 spam_2.10-0
[85] IRdisplay_1.1 RcppHNSW_0.6.0
[87] later_1.3.2 splines_4.3.3
[89] lattice_0.22-6 survival_3.6-4
[91] deldir_2.0-4 tidyselect_1.2.1
[93] miniUI_0.1.1.1 pbapply_1.7-2
[95] gridExtra_2.3 IRanges_2.36.0
[97] SummarizedExperiment_1.32.0 scattermore_1.2
[99] stats4_4.3.3 Biobase_2.62.0
[101] matrixStats_1.3.0 stringi_1.8.4
[103] lazyeval_0.2.2 evaluate_0.23
[105] codetools_0.2-20 tibble_3.2.1
[107] cli_3.6.2 uwot_0.1.16
[109] IRkernel_1.3.2 xtable_1.8-4
[111] reticulate_1.37.0 repr_1.1.7
[113] munsell_0.5.1 GenomeInfoDb_1.38.1
[115] Rcpp_1.0.12 globals_0.16.3
[117] spatstat.random_3.2-3 png_0.1-8
[119] parallel_4.3.3 dotCall64_1.1-1
[121] sparseMatrixStats_1.14.0 bitops_1.0-7
[123] listenv_0.9.1 viridisLite_0.4.2
[125] scales_1.3.0 ggridges_0.5.6
[127] leiden_0.4.3.1 purrr_1.0.2
[129] crayon_1.5.2 rlang_1.1.3
[131] cowplot_1.1.3
I should note that including library(future); plan("sequential") in the function called by Q does not solve the stalling problem.
This works on my system. If I run Q with log_worker=T, I get the following error in the console (and no additional log error):
Error: 1/1 jobs failed (0 warnings). Stopping. (Error #1) could not find function "%>%"
If I then load the dplyr package in addition to Seurat this runs without errors (on multiprocess):
Q(fx, seurat_obj=c(pbmc), pkgs=c("Seurat", "dplyr"), n_jobs=1, job_size=1, memory=12 * 1024)
Thanks for the quick feedback, and thanks for pointing out the issue with my reprex. While stripping down my code, I accidentally removed dplyr.
I've done some more testing, and it appears that the issue is with loading Seurat or SeuratData. If I simply run:
library(clustermq)
options(clustermq.scheduler = "slurm")
fx = function(x) {
library(Seurat)
return (x * 2)
}
Q(fx, x=1:3, n_jobs=3, pkgs=c("Seurat"), job_size=4, memory=8192)
The job takes 5 minutes. The same goes if I swap out Seurat for SeuratData. However, the job only takes 15-20 seconds if I swap out Seurat with dplyr.
Any ideas why the jobs are taking so long to just load R packages? Maybe this suggests just terrible performance in general, and loading Seurat requires more compute (maybe some I/O) than dplyr? The I/O is often a bottleneck on my HPC, so I wonder if that is causing the problem.
I'm also running clustermq from a Jupyter notebook running an R kernel. The conda env variables seem to be transferred to the clustermq jobs correctly, but maybe this is causing the issue.
The following does work, but it takes ~5.5 minutes:
seurat_preprocess = function(seurat_obj, dims=1:30) {
# load libraries
library(dplyr)
library(Seurat)
# pre-process seurat object
seurat_obj %>%
SCTransform() %>%
RunPCA() %>%
FindNeighbors(dims=dims) %>%
FindClusters() %>%
RunUMAP(dims=dims)
}
pbmc = Q(
seurat_preprocess,
seurat_obj=c(pbmc),
const=list(dims=1:30),
pkgs=c("dplyr", "Seurat"),
n_jobs=1,
job_size=1,
log_workers=TRUE,
template = list(
memory = 8 * 1024,
log_file = "clustermq.log"
))[[1]]
pbmc
Based on the logs, it appears that almost all of that time was spend on loading the Seurat package.
How long does loading the package in an interactive job take?
In any case, we're just executing the R function code, so I don't think clustermq can do anything to make the package loading faster.
How long does loading the package in an interactive job take?
Just a couple of seconds.
There's obviously an issue with the clustermq jobs, if loading Seurat only takes a couple of seconds in an interactive session on the HPC (the session is running on a cluster node), while the package takes many minutes to load if submitted via clustermq as an sbatch job.
It's likely something odd about our Slurm setup. I'll work with the cluster admin. I'd appreciate any thoughts on the matter, if you have them.
The cluster admin and I have done some more testing. Based on profiling the "hanging" job via gdb, it appears that it's in a sleep loop:
#0 0x00007f8562bd97f8 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7f85412ee0a0 <cli.tick_ts>, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
sc_cancel_oldtype = 1
sc_ret = <optimized out>
r = <optimized out>
#1 0x00007f8562bde677 in __GI___nanosleep (req=<optimized out>, rem=<optimized out>) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
ret = <optimized out>
#2 0x00007f85412d09b3 in clic_thread_func () from target:/home/nickyoungblut/miniforge3/envs/seurat-v5/lib/R/library/cli/libs/cli.so
No symbol table info available.
#3 0x00007f8562b88ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
ret = <optimized out>
pd = <optimized out>
out = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140734697572832, -576173798191225971, 140210300778048, 25, 140210863638480, 140734697573184, 507647466203292557, 507712184217674637}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
#4 0x00007f8562c1a850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.
Thread 6 (Thread 0x7f8541f54640 (LWP 985544) "ZMQbg/IO/0"):
#0 0x00007f8562c19e2e in epoll_wait (epfd=10, events=0x7f8541f440e0, maxevents=256, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
sc_ret = -4
sc_cancel_oldtype = 0
sc_ret = <optimized out>
#1 0x00007f8542ef2709 in zmq::epoll_t::loop() () from target:/home/nickyoungblut/miniforge3/envs/seurat-v5/lib/R/library/clustermq/libs/../../../../libzmq.so.5
... but note that the process is using ~80% cpu:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
nickyou+ 985519 80.1 0.0 1721072 960524 ? Rl 17:53 12:13 /home/nickyoungblut/miniforge3/envs/seurat-v5/lib/R/bin/exec/R --no-save --no-restore -e clustermq:::worker("tcp://GPU36DC:9592")
I've investigated a bit and I can confirm that calling SCTransform on the pbmc dataset hangs with clustermq, both with the slurm scheduler as well as local. I have not seen any issues with package loading on my machine.
The minimal way to reproduce the hang (from the clustermq side) is:
library(Seurat)
library(SeuratData)
pbmc = LoadData("pbmc3k", type = "pbmc3k.final")
system.time(SCTransform(pbmc)) # bit under 3 minutes
# the below hangs:
withCallingHandlers(
withRestarts(
SCTransform(pbmc),
returnError = function(e) structure(e, class="error")
),
error = function(e) {
invokeRestart("returnError", conditionMessage(e))
}
)
The calling handlers are a part of work_chunk. However, it seems to me that the above example should work with Seurat (and calling any other function besides SCTransform works), so at this point I think this may be a bug in their package.
I've added the comment on the linked Seurat issue.
I'm closing this here, please reopen if you think there is anything we should change from the clustermq side.