methylKit icon indicating copy to clipboard operation
methylKit copied to clipboard

Unite myobj with hundreds of samples on cluster (killed error)

Open lz870718 opened this issue 4 years ago • 15 comments

Hi Dr. Akalin,

Thank you for providing such great tool. We just encountered some errors when unite myobj with hundreds of samples. I even ran it with the maximum memory allowed for individual use. Still can't work it out.

Is there a way I can first unite a couple sample into one methylBaseDB, then merge it later with other united methylBaseDBs to create a meta methylBaseDB for downstream analysis. Or is there any other way to get around with it.

PS: I run unite function with smaller sample size (5-20) without any problem.

I really appreciate your time and help!

Zhi

lz870718 avatar Jul 17 '19 14:07 lz870718

Hi Zhi,

Thank you for using the package! I have a hacky solution to your problem, that could be worth integrating it into methylKit.

library(methylKit)

## specify example files
file.list=list( system.file("extdata", "test1.myCpG.txt", package = "methylKit"),
                system.file("extdata", "test2.myCpG.txt", package = "methylKit"),
                system.file("extdata", "control1.myCpG.txt", package = "methylKit"),
                system.file("extdata", "control2.myCpG.txt", package = "methylKit") )

## read into methylRawDB
mydblist = suppressMessages(methRead( file.list,
                                      sample.id=list("t1","t2","c1","c2"),assembly="hg18",
                                      pipeline="amp",treatment=c(1,1,0,0),dbtype = "tabix",dbdir="methylDB"))

## make larger object
mydblist <- methylRawListDB(mydblist[[1]],mydblist[[2]],mydblist[[3]],mydblist[[4]],
                       mydblist[[1]],mydblist[[2]],mydblist[[3]],mydblist[[4]],
                       mydblist[[1]],mydblist[[2]],mydblist[[3]],mydblist[[4]],
                       mydblist[[1]],mydblist[[2]],mydblist[[3]],mydblist[[4]],
                       treatment = c(rep(0,8),rep(1,8)))

## use profvis to monitor memory usage
library(profvis)

profvis::profvis({
  ## unite whole object
  suppressMessages(methidhDB <- unite(mydblist))

  ## unite splits
  suppressMessages(methidhDB1 <- unite(methylRawListDB(mydblist[1:10],treatment = getTreatment(mydblist)[1:10])))
  suppressMessages(methidhDB2 <- unite(methylRawListDB(mydblist[11:16],treatment = getTreatment(mydblist)[11:16])))
  
  ## merge splits afterwards
  dbpath <- gsub(".tbi","",methylKit::mergeTabix(tabixList = c(getDBPath(methidhDB1),getDBPath(methidhDB2)),dir = "methylDB/",filename = "newMethylBaseDB"))
  
  ## load merged splits
  mybasedb <- readMethylBaseDB(dbpath = dbpath,dbtype = mydblist[[1]]@dbtype,
                               sample.ids = getSampleID(mydblist),assembly = mydblist[[1]]@assembly,
                               context = mydblist[[1]]@context,resolution = mydblist[[1]]@resolution,
                               treatment = mydblist@treatment,destranded = FALSE)
  
  ## check for equality
  all.equal(mybasedb,methidhDB)
  identical(getData(mybasedb),getData(methidhDB))
})

Please note that I did not do extensive testing, I only checked this with our example data, but I would really like to hear whether this solves your problem.

Best, Alex

alexg9010 avatar Jul 25 '19 10:07 alexg9010

Hi Alex,

Thank you so much for your reply. I just tried the solution provided. However, I encounter below error message

Error: 'mergeTabix' is not an exported object from 'namespace:methylKit'

Could you help me have a look at it and see if I made any mistake? Thanks!

PS My sessionInfo() is :

R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux

Matrix products: default BLAS: /gpfs/share/apps/R/3.5.1/lib64/R/lib/libRblas.so LAPACK: /gpfs/share/apps/R/3.5.1/lib64/R/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] methylKit_1.8.1 GenomicRanges_1.34.0 GenomeInfoDb_1.18.2 [4] IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0

loaded via a namespace (and not attached): [1] SummarizedExperiment_1.12.0 qvalue_2.14.1
[3] gtools_3.8.1 tidyselect_0.2.5
[5] reshape2_1.4.3 purrr_0.3.2
[7] splines_3.5.1 lattice_0.20-38
[9] colorspace_1.4-1 rtracklayer_1.42.2
[11] XML_3.98-1.20 rlang_0.3.4
[13] R.oo_1.22.0 pillar_1.4.1
[15] glue_1.3.1 R.utils_2.8.0
[17] BiocParallel_1.16.6 fastseg_1.28.0
[19] matrixStats_0.54.0 GenomeInfoDbData_1.2.0
[21] plyr_1.8.4 stringr_1.4.0
[23] zlibbioc_1.28.0 Biostrings_2.50.2
[25] munsell_0.5.0 gtable_0.3.0
[27] R.methodsS3_1.7.1 coda_0.19-2
[29] Biobase_2.42.0 Rcpp_1.0.1
[31] scales_1.0.0 limma_3.38.3
[33] DelayedArray_0.8.0 XVector_0.22.0
[35] Rsamtools_1.34.1 ggplot2_3.1.1
[37] stringi_1.4.3 dplyr_0.8.1
[39] numDeriv_2016.8-1.1 grid_3.5.1
[41] bitops_1.0-6 tools_3.5.1
[43] bbmle_1.0.20 magrittr_1.5
[45] lazyeval_0.2.2 RCurl_1.95-4.12
[47] tibble_2.1.3 crayon_1.3.4
[49] pkgconfig_2.0.2 Matrix_1.2-17
[51] MASS_7.3-51.4 data.table_1.12.2
[53] assertthat_0.2.1 emdbook_1.3.11
[55] R6_2.4.0 mclust_5.4.3
[57] GenomicAlignments_1.18.1 compiler_3.5.1

lz870718 avatar Jul 27 '19 03:07 lz870718

Hi Zhi,

Can you try to call it with am additional colon? methylKit:::mergeTabix

alexg9010 avatar Jul 27 '19 09:07 alexg9010

Hi Alex,

The methylKit::mergeTabix step works now with additional ':'. I now got a new error as below. Could you help me with it?

Error in readMethylBaseDB(dbpath = dbpath, dbtype = methylbaseDB1@dbtype, : could not find function "readMethylBaseDB"

PS the library is loaded.

lz870718 avatar Jul 27 '19 18:07 lz870718

Sorry, please use methylKit:::readMethylBaseDB.

These functions are not exported to the user, but are used internally by us developers. This is why you have to access them with methylKit:::.

alexg9010 avatar Jul 27 '19 19:07 alexg9010

I got it working with two batches of my data. I will try it out now with the full data set and see how it goes. Will keep you posted. Thanks again for your help.

lz870718 avatar Jul 27 '19 19:07 lz870718

Hi Alex,

When I try running it with all data. I got following error:

Error in merge.data.table(res, tmp, by = c("V1", "V2", "V3", "V4"), all = all) : x has some duplicated column name(s): V8.x,V9.x,V10.x,V11.x,V12.x,V13.x,V14.x,V15.x,V16.x,V17.x,V18.x,V19.x,V20.x,V21.x,V22.x,V23.x,V24.x,V25.x,V26.x,V27.x,V28.x,V29.x,V30.x,V31.x,V32.x,V33.x,V34.x,V35.x,V36.x,V37.x,V38.x,V39.x,V40.x,V41.x,V42.x,V43.x,V44.x,V45.x,V46.x,V47.x,V48.x,V49.x,V50.x,V51.x,V52.x,V53.x,V54.x,V55.x,V56.x,V57.x,V58.x,V59.x,V60.x,V61.x,V62.x,V63.x,V64.x,V65.x,V66.x,V67.x,V68.x,V69.x,V70.x,V71.x,V72.x,V73.x,V74.x,V75.x,V76.x,V77.x,V78.x,V79.x,V80.x,V81.x,V82.x,V83.x,V84.x,V85.x,V86.x,V87.x,V88.x,V89.x,V90.x,V91.x,V92.x,V93.x,V94.x,V95.x,V96.x,V97.x,V98.x,V99.x,V100.x,V101.x,V102.x,V103.x,V104.x,V105.x,V106.x,V107.x,V108.x,V109.x,V110.x,V111.x,V112.x,V113.x,V114.x,V115.x,V116.x,V117.x,V118.x,V119.x,V120.x,V121.x,V122.x,V123.x,V124.x,V125.x,V126.x,V127.x,V128.x,V129.x,V130.x,V131.x,V132.x,V133.x,V134.x,V135.x,V136.x,V137.x,V138.x,V139.x,V140.x,V141.x,V142.x,V143.x,V144.x,V145.x,V146.x,V147.x,V148.x,V8.y,V9.y,V10.y,V11.y,V12.y,V13.y,V14.y,V15.y,V16.y,V17.y,V Calls: gsub ... lapply -> FUN -> merge -> merge -> merge.data.table

Not sure if it is something has been reported before.

lz870718 avatar Jul 28 '19 03:07 lz870718

Hi Alex,

I worked around previous error message by merging no more than 4 methylbaseDB objects once a time. Then jointly merged the merged methylbaseDB objects later. Thank you for your help!

Zhi

lz870718 avatar Jul 29 '19 21:07 lz870718

Sorry.. Just got a new error while trying to check the dimension of the merged object:

dim(getData(batch_all)) Error in paste(tabixRes[[1]], collapse = "\n") : result would exceed 2^31-1 bytes

lz870718 avatar Jul 29 '19 22:07 lz870718

Hi Zhi, Can you try to do this chromosome by chromosome ? What happens when you use chr1 and chr2 only, do you still get an error ? We would like to debug this as methylDB shouldn't have big problems with memory however we don't have a smallest possible dataset that recreates the error. If we know what parameters number of samples and number of CpGs per sample creates this error we can simulate a dataset or use your example.

Best, Altuna

On Tue, Jul 30, 2019 at 12:20 AM lz870718 [email protected] wrote:

Sorry.. Just got a new error while trying to check the dimension of the merged object:

dim(getData(batch_all)) Error in paste(tabixRes[[1]], collapse = "\n") : result would exceed 2^31-1 bytes

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/158?email_source=notifications&email_token=AAE32EIZUINEMU354ZGB27DQB5ULRA5CNFSM4IEQ7PLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3CFSNY#issuecomment-516184375, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE32EORZMIPX4NFLFGZXILQB5ULRANCNFSM4IEQ7PLA .

al2na avatar Jul 30 '19 02:07 al2na

Hi Dr. Akalin,

I tried with a subset of samples (n=200) and CpG bases (n=4,749,070) with no problem. When I extended the function with all samples (n=500) and CpG bases (n>10,000,000), the error came up.

Would you recommend me, in my case, to use tiling windows analysis to bring down the number of rows in the methylbaseDB object? I am not sure if/how this will affect differential/association analysis downstream, thus your advice is greatly appreciated.

PS: I've already excluded CHH and CHG sites from my analysis.

Best, Zhi

lz870718 avatar Jul 30 '19 03:07 lz870718

yes, you can try tiling if it decreases the number of rows. I would try 100 or 500 bp tiles

Best, Altuna

On Tue, Jul 30, 2019 at 5:03 AM lz870718 [email protected] wrote:

Hi Dr. Akalin,

I tried with a subset of samples (n=200) and CpG bases (n=4,749,070) with no problem. When I extended the function with all samples (n=500) and CpG bases (n>10,000,000), the error came up.

Would you recommend me, in my case, to use tiling windows analysis to bring down the number of rows in the methylbaseDB object? I am not sure if/how this will affect differential/association analysis downstream, thus your advice is greatly appreciated.

PS: I've already excluded CHH and CHG sites from my analysis.

Best, Zhi

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/158?email_source=notifications&email_token=AAE32EIS237WN5DCTKPTBATQB6VOLA5CNFSM4IEQ7PLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3CTLDQ#issuecomment-516240782, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE32EM63ZV4IPAXTQI6OQ3QB6VOLANCNFSM4IEQ7PLA .

al2na avatar Jul 30 '19 06:07 al2na

Hi Zhi,

to get the dimension of your tabix file you could try to use this for now:

df <- fread.gzipped(getDBPath(batch_all))
dim(df)

It is possible that you are still able to work with batch_all as a methylBaseDB object, if you decrease the chunk.size parameter in every function call that you make. Also tileMethylCounts works chromosome by chromosome so it should work for you.

Best, Alex

alexg9010 avatar Jul 30 '19 08:07 alexg9010

Hi Alex and Altuna,

Thank you for your help on this matter. I am trying to annotate the whole big methylBaseDB object with annotateWithGeneParts() function. But I got this error message again as below.

Error in paste(tabixRes[[1]], collapse = "\n") : result would exceed 2^31-1 bytes Calls: annotateWithGeneParts ... getTabixByChunk -> tabix2df -> fread -> paste0 -> paste Execution halted

I was thinking about subset the methylBaseDB object by different chromosomes, but didn't find a good way to do so. Any suggestions would be appreciated.

Zhi

lz870718 avatar Apr 16 '20 22:04 lz870718

Hi @lz870718 ,

Sorry for the late reply. I do not know whether this is still required, but we have an internal function called applyTbxByChr defined here that applies a function on a tabix file per chromosome. See here (https://github.com/al2na/methylKit/blob/9620cd686a14483f2158c75805c952d066c7b7f8/R/methylDBFunctions.R#L1667) and here (https://github.com/al2na/methylKit/blob/9620cd686a14483f2158c75805c952d066c7b7f8/R/methylDBFunctions.R#L2305) for examples.

Best, Alex

alexg9010 avatar Apr 11 '22 09:04 alexg9010