EWCE
EWCE copied to clipboard
generate_celltype_data struggling with a large matrix: "too many elements (>= 2^31) selected along dimension 1 of array"
1. Bug description
When running generate_celltype_data()
on a gene-by-cell matrix of 34807 x 786896, I get a large array error. My input object (exp
) is sparse DelayedMatrix object of type "double"
.
Console output
100 core(s) assigned as workers (156 reserved).
Converting to sparse matrix.
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'drop': too many elements (>= 2^31) selected along dimension 1 of array
Session info
R version 4.3.0 (2023-04-21)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS
Matrix products: default
BLAS/LAPACK: /home/drihome/MRAineFairbrotherBrowne/miniconda3/envs/R/lib/libopenblasp-r0.3.23.so; LAPACK version 3.11.0
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] scuttle_1.10.1 DelayedMatrixStats_1.22.1
[3] HDF5Array_1.28.1 rhdf5_2.44.0
[5] DelayedArray_0.26.6 S4Arrays_1.0.4
[7] Matrix_1.6-1 DESeq2_1.40.2
[9] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
[11] Biobase_2.60.0 GenomicRanges_1.52.0
[13] GenomeInfoDb_1.36.1 IRanges_2.34.1
[15] S4Vectors_0.38.1 MatrixGenerics_1.12.2
[17] matrixStats_1.0.0 lubridate_1.9.2
[19] forcats_1.0.0 stringr_1.5.0
[21] dplyr_1.1.3 purrr_1.0.1
[23] readr_2.1.4 tidyr_1.3.0
[25] tibble_3.2.1 ggplot2_3.4.3
[27] tidyverse_2.0.0 ewceData_1.8.0
[29] ExperimentHub_2.8.1 AnnotationHub_3.8.0
[31] BiocFileCache_2.8.0 dbplyr_2.3.3
[33] BiocGenerics_0.46.0 EWCE_1.8.2
[35] RNOmni_1.0.1.2
loaded via a namespace (and not attached):
[1] jsonlite_1.8.7 magrittr_2.0.3
[3] fs_1.6.3 zlibbioc_1.46.0
[5] vctrs_0.6.3 memoise_2.0.1
[7] RCurl_1.98-1.12 ggtree_3.8.2
[9] rstatix_0.7.2 htmltools_0.5.5
[11] curl_5.0.1 broom_1.0.5
[13] Rhdf5lib_1.22.0 gridGraphics_0.5-1
[15] basilisk_1.12.1 htmlwidgets_1.6.2
[17] HGNChelper_0.8.1 plyr_1.8.8
[19] plotly_4.10.3 cachem_1.0.8
[21] mime_0.12 lifecycle_1.0.3
[23] pkgconfig_2.0.3 R6_2.5.1
[25] fastmap_1.1.1 GenomeInfoDbData_1.2.10
[27] shiny_1.7.4.1 digest_0.6.33
[29] aplot_0.2.2 colorspace_2.1-0
[31] patchwork_1.1.2 AnnotationDbi_1.62.2
[33] rprojroot_2.0.3 grr_0.9.5
[35] RSQLite_2.3.1 ggpubr_0.6.0
[37] beachmat_2.16.0 filelock_1.0.2
[39] timechange_0.2.0 fansi_1.0.4
[41] httr_1.4.6 abind_1.4-5
[43] compiler_4.3.0 here_1.0.1
[45] withr_2.5.0 bit64_4.0.5
[47] backports_1.4.1 BiocParallel_1.34.2
[49] orthogene_1.6.1 carData_3.0-5
[51] DBI_1.1.3 homologene_1.4.68.19.3.27
[53] ggsignif_0.6.4 rappdirs_0.3.3
[55] tools_4.3.0 ape_5.7-1
[57] interactiveDisplayBase_1.38.0 httpuv_1.6.11
[59] glue_1.6.2 rhdf5filters_1.12.1
[61] nlme_3.1-163 promises_1.2.0.1
[63] grid_4.3.0 anndata_0.7.5.6
[65] reshape2_1.4.4 generics_0.1.3
[67] gtable_0.3.4 tzdb_0.4.0
[69] hms_1.1.3 data.table_1.14.8
[71] car_3.1-2 utf8_1.2.3
[73] XVector_0.40.0 BiocVersion_3.17.1
[75] pillar_1.9.0 yulab.utils_0.1.0
[77] babelgene_22.9 limma_3.56.2
[79] later_1.3.1 treeio_1.24.3
[81] lattice_0.21-8 bit_4.0.5
[83] tidyselect_1.2.0 locfit_1.5-9.8
[85] Biostrings_2.68.1 stringi_1.7.12
[87] lazyeval_0.2.2 ggfun_0.1.3
[89] yaml_2.3.7 codetools_0.2-19
[91] BiocManager_1.30.21.1 ggplotify_0.1.2
[93] cli_3.6.1 xtable_1.8-4
[95] reticulate_1.30 munsell_0.5.0
[97] zellkonverter_1.10.1 Rcpp_1.0.11
[99] dir.expiry_1.8.0 gprofiler2_0.2.2
[101] png_0.1-8 ellipsis_0.3.2
[103] assertthat_0.2.1 blob_1.2.4
[105] basilisk.utils_1.12.1 sparseMatrixStats_1.12.2
[107] bitops_1.0-7 viridisLite_0.4.2
[109] tidytree_0.4.5 scales_1.2.1
[111] crayon_1.5.2 rlang_1.1.1
[113] KEGGREST_1.40.0
Hey! Yep definitely a size issue, are you running the function with the as_DelayedArray = TRUE
? By default it is FALSE
, this might help. It's hard to debug exactly where you are running into the size issue without having access to the data.
Alan.
@ainefairbrother it would also be very helpful to know what kind of machine you're running this on (storage, memory, is it an computing cluster? etc.). Could you give us more details on this?
Something that can help is to reduce the number of cores being used. Due to how R often deals with parallelisation (copying the entire environment) more cores isn't always better as it can lead to an explosion of memory usage. This will take some troubleshooting on your end to figure out what your machine can handle.
@ainefairbrother it would also be very helpful to know what kind of machine you're running this on (storage, memory, is it an computing cluster? etc.). Could you give us more details on this?
Something that can help is to reduce the number of cores being used. Due to how R often deals with parallelisation (copying the entire environment) more cores isn't always better as it can lead to an explosion of memory usage. This will take some troubleshooting on your end to figure out what your machine can handle.
This is a server with ~133TB storage, 1TB RAM
Architecture and CPU info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Linux kernal version: 5.15.0-73-generic
@ainefairbrother it would also be very helpful to know what kind of machine you're running this on (storage, memory, is it an computing cluster? etc.). Could you give us more details on this?
Something that can help is to reduce the number of cores being used. Due to how R often deals with parallelisation (copying the entire environment) more cores isn't always better as it can lead to an explosion of memory usage. This will take some troubleshooting on your end to figure out what your machine can handle.
Noted, thanks. I have tried with setting as_DelayedArray
to TRUE
and reducing the number of cores. I get a different error:
5 core(s) assigned as workers (251 reserved).
Error in .from_Array_to_matrix(x, ...) : unused argument ("matrix")
You will need to create and share a subset of your data which replicates the issue if you want us to debug this, the function works with the as_DelayedArray = TRUE
with other datasets:
> # Load the single cell data
> cortex_mrna <- ewceData::cortex_mrna()
see ?ewceData and browseVignettes('ewceData') for documentation
loading from cache
> # Use only a subset to keep the example quick
> expData <- cortex_mrna$exp#[1:100, ]
> l1 <- cortex_mrna$annot$level1class
> l2 <- cortex_mrna$annot$level2class
> annotLevels <- list(l1 = l1, l2 = l2)
> fNames_ALLCELLS <- EWCE::generate_celltype_data(
+ exp = expData,
+ annotLevels = annotLevels,
+ groupName = "allKImouse",
+ as_DelayedArray = TRUE
+ )
1 core(s) assigned as workers (10 reserved).
Converting to sparse matrix.
Converting to DelayedArray.
+ Calculating normalized mean expression.
Converting to sparse matrix.
Converting to sparse matrix.
+ Calculating normalized specificity.
Converting to sparse matrix.
Converting to sparse matrix.
Converting to sparse matrix.
Converting to sparse matrix.
+ Saving results ==> /tmp/Rtmpc3QW1s/ctd_allKImouse.rda
Subsetting on the number of genes and cells should be relatively quick to do.
Noted, thanks. I have tried with setting
as_DelayedArray
toTRUE
and reducing the number of cores. I get a different error:5 core(s) assigned as workers (251 reserved). Error in .from_Array_to_matrix(x, ...) : unused argument ("matrix")
It looks like this is coming from somewhere within DelayedArray, but can't quite pinpoint where or why yet.
Also, @ainefairbrother , can you confirm that your exp
DelayedArray object is backed at time you're running the function? ie saved to disk and only reading in the each chunk into memory as-needed. That will ensure you're fully benefitting from the DelayedArray capabilities.
This can be done using this set of functions: https://rdrr.io/bioc/HDF5Array/man/saveHDF5SummarizedExperiment.html
See here for more info: https://www.bioconductor.org/packages/release/bioc/vignettes/DelayedArray/inst/doc/02-Implementing_a_backend.html
Noted, thanks. I have tried with setting
as_DelayedArray
toTRUE
and reducing the number of cores. I get a different error:5 core(s) assigned as workers (251 reserved). Error in .from_Array_to_matrix(x, ...) : unused argument ("matrix")
It looks like this is coming from somewhere within DelayedArray, but can't quite pinpoint where or why yet.
Also, @ainefairbrother , can you confirm that your
exp
DelayedArray object is backed at time you're running the function? ie saved to disk and only reading in the each chunk into memory as-needed. That will ensure you're fully benefitting from the DelayedArray capabilities.This can be done using this set of functions: https://rdrr.io/bioc/HDF5Array/man/saveHDF5SummarizedExperiment.html
See here for more info: https://www.bioconductor.org/packages/release/bioc/vignettes/DelayedArray/inst/doc/02-Implementing_a_backend.html
Hi Brian - yes, can confirm that it's backed at the time of running the function.