genomation icon indicating copy to clipboard operation
genomation copied to clipboard

scoreMatrixBin allocates 415 GB of RAM (1000 bins, 55,000 regions)

Open balwierz opened this issue 5 years ago • 2 comments

I am analysing some low resolution data and need 1000 bins of 1kb each (covering 1MB in total). My estimation for size of such an object would be roughly 1000 * 55000 * sizeof(double) equals to 419.6 MB which is the size reported by R below. However, during scoreMatrixBin call sm = ScoreMatrixBin(target=track, windows=windows, bin.num=1000, strand.aware=TRUE, weight.col="score") 415 GB of memory is allocated which is not released after gc(). I am not sure if this is genomation or R issue.

                       Type      Size PrettySize   Rows Columns
sm              ScoreMatrix 441432328     421 Mb  54741    1000
track               GRanges   1712096     1.6 Mb 106205      NA


  gc()
           used  (Mb) gc trigger     (Mb)  max used     (Mb)
Ncells  8096903 432.5  1.177e+07    628.4 1.177e+07    628.4
Vcells 71511568 545.6  7.377e+10 562797.2 5.571e+10 425044.0

  sessionInfo()
Error in system(paste(which, shQuote(names[i])), intern = TRUE, ignore.stderr = TRUE) : 
  cannot popen '/usr/bin/which 'uname' 2>/dev/null', probable reason 'Cannot allocate memory'

in shell

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                       
 2570 user      20   0  416,3g 415,4g  59208 S   0,0  82,5  15:06.29 rsession  

Afer R restart:

  sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux buster/sid

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so

locale:
 [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C               LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8    
 [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8    LC_PAPER=en_DK.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] genomation_1.14.0   BiocParallel_1.16.6

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1                  lattice_0.20-38             prettyunits_1.0.2           Rsamtools_1.34.1           
 [5] Biostrings_2.50.2           assertthat_0.2.1            digest_0.6.18               gridBase_0.4-7             
 [9] R6_2.4.0                    GenomeInfoDb_1.18.2         plyr_1.8.4                  stats4_3.5.2               
[13] RSQLite_2.1.1               httr_1.4.0                  ggplot2_3.1.1               pillar_1.4.0               
[17] zlibbioc_1.28.0             rlang_0.3.4                 GenomicFeatures_1.34.8      progress_1.2.2             
[21] lazyeval_0.2.2              rstudioapi_0.10             data.table_1.12.2           blob_1.1.1                 
[25] S4Vectors_0.20.1            Matrix_1.2-17               readr_1.3.1                 stringr_1.4.0              
[29] RCurl_1.95-4.12             bit_1.1-14                  biomaRt_2.38.0              munsell_0.5.0              
[33] DelayedArray_0.8.0          compiler_3.5.2              rtracklayer_1.42.2          pkgconfig_2.0.2            
[37] BiocGenerics_0.28.0         tidyselect_0.2.5            SummarizedExperiment_1.12.0 tibble_2.1.1               
[41] GenomeInfoDbData_1.2.0      IRanges_2.16.0              matrixStats_0.54.0          XML_3.98-1.19              
[45] crayon_1.3.4                dplyr_0.8.1                 GenomicAlignments_1.18.1    bitops_1.0-6               
[49] gtable_0.3.0                DBI_1.0.0                   magrittr_1.5                scales_1.0.0               
[53] KernSmooth_2.23-15          stringi_1.4.3               impute_1.56.0               reshape2_1.4.3             
[57] XVector_0.22.0              tools_3.5.2                 bit64_0.9-7                 BSgenome_1.50.0            
[61] Biobase_2.42.0              glue_1.3.1                  seqPattern_1.7.0            purrr_0.3.2                
[65] hms_0.4.2                   plotrix_3.7-5               parallel_3.5.2              AnnotationDbi_1.44.0       
[69] colorspace_1.4-1            GenomicRanges_1.34.0        memoise_1.1.0     

balwierz avatar May 18 '19 16:05 balwierz

could you send a reproducible example, it doesn't have to be the full example. It just has to exemplify the memory problem. we have to do some sort of memory profiling to see where the problem is.

On Sat, May 18, 2019 at 6:40 PM Piotr Balwierz [email protected] wrote:

I am analysing some low resolution data and need 1000 bins of 1kb each (covering 1MB in total). My estimation for size of such an object would be roughly 1000 * 55000 * sizeof(double) equals to 419.6 MB which is the size reported by R below. However, during scoreMatrixBin call sm = ScoreMatrixBin(target=track, windows=windows, bin.num=1000, strand.aware=TRUE, weight.col="score") 415 GB of memory is allocated which is not released after gc(). I am not sure if this is genomation or R issue.

                   Type      Size PrettySize   Rows Columns

sm ScoreMatrix 441432328 421 Mb 54741 1000 track GRanges 1712096 1.6 Mb 106205 NA

gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 8096903 432.5 1.177e+07 628.4 1.177e+07 628.4 Vcells 71511568 545.6 7.377e+10 562797.2 5.571e+10 425044.0

sessionInfo() Error in system(paste(which, shQuote(names[i])), intern = TRUE, ignore.stderr = TRUE) : cannot popen '/usr/bin/which 'uname' 2>/dev/null', probable reason 'Cannot allocate memory'

in shell

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2570 user 20 0 416,3g 415,4g 59208 S 0,0 82,5 15:06.29 rsession

Afer R restart:

sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux buster/sid

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so

locale: [1] LC_CTYPE=en_DK.UTF-8 LC_NUMERIC=C LC_TIME=en_DK.UTF-8 LC_COLLATE=en_DK.UTF-8 [5] LC_MONETARY=en_DK.UTF-8 LC_MESSAGES=en_DK.UTF-8 LC_PAPER=en_DK.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] grid stats graphics grDevices utils datasets methods base

other attached packages: [1] genomation_1.14.0 BiocParallel_1.16.6

loaded via a namespace (and not attached): [1] Rcpp_1.0.1 lattice_0.20-38 prettyunits_1.0.2 Rsamtools_1.34.1 [5] Biostrings_2.50.2 assertthat_0.2.1 digest_0.6.18 gridBase_0.4-7 [9] R6_2.4.0 GenomeInfoDb_1.18.2 plyr_1.8.4 stats4_3.5.2 [13] RSQLite_2.1.1 httr_1.4.0 ggplot2_3.1.1 pillar_1.4.0 [17] zlibbioc_1.28.0 rlang_0.3.4 GenomicFeatures_1.34.8 progress_1.2.2 [21] lazyeval_0.2.2 rstudioapi_0.10 data.table_1.12.2 blob_1.1.1 [25] S4Vectors_0.20.1 Matrix_1.2-17 readr_1.3.1 stringr_1.4.0 [29] RCurl_1.95-4.12 bit_1.1-14 biomaRt_2.38.0 munsell_0.5.0 [33] DelayedArray_0.8.0 compiler_3.5.2 rtracklayer_1.42.2 pkgconfig_2.0.2 [37] BiocGenerics_0.28.0 tidyselect_0.2.5 SummarizedExperiment_1.12.0 tibble_2.1.1 [41] GenomeInfoDbData_1.2.0 IRanges_2.16.0 matrixStats_0.54.0 XML_3.98-1.19 [45] crayon_1.3.4 dplyr_0.8.1 GenomicAlignments_1.18.1 bitops_1.0-6 [49] gtable_0.3.0 DBI_1.0.0 magrittr_1.5 scales_1.0.0 [53] KernSmooth_2.23-15 stringi_1.4.3 impute_1.56.0 reshape2_1.4.3 [57] XVector_0.22.0 tools_3.5.2 bit64_0.9-7 BSgenome_1.50.0 [61] Biobase_2.42.0 glue_1.3.1 seqPattern_1.7.0 purrr_0.3.2 [65] hms_0.4.2 plotrix_3.7-5 parallel_3.5.2 AnnotationDbi_1.44.0 [69] colorspace_1.4-1 GenomicRanges_1.34.0 memoise_1.1.0

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BIMSBbioinfo/genomation/issues/184?email_source=notifications&email_token=AAE32ENVL4L2SZNFV6Y736TPWAWONA5CNFSM4HN2UQV2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GURK5WA, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE32ELFMU5COMO6IZZJ2XLPWAWONANCNFSM4HN2UQVQ .

al2na avatar May 18 '19 17:05 al2na

library(genomation)
library(BSgenome.Mmusculus.UCSC.mm9)
library("TxDb.Mmusculus.UCSC.mm9.knownGene")

track = unlist(GenomicRanges::tileGenome(tilewidth=25000, seqlengths=seqlengths(Mmusculus)))
track$score = rnorm(length(track))
sm = ScoreMatrixBin(target=track, windows=promoters(TxDb.Mmusculus.UCSC.mm9.knownGene, upstream=500000, downstream=500000), bin.num=1000, strand.aware=TRUE, weight.col="score")

You might want to scale the problem down if not running on 512GB+ machine.

balwierz avatar May 18 '19 17:05 balwierz