methylKit
methylKit copied to clipboard
Unite myobj with hundreds of samples on cluster (killed error)
Hi Dr. Akalin,
Thank you for providing such great tool. We just encountered some errors when unite myobj with hundreds of samples. I even ran it with the maximum memory allowed for individual use. Still can't work it out.
Is there a way I can first unite a couple sample into one methylBaseDB, then merge it later with other united methylBaseDBs to create a meta methylBaseDB for downstream analysis. Or is there any other way to get around with it.
PS: I run unite function with smaller sample size (5-20) without any problem.
I really appreciate your time and help!
Zhi
Hi Zhi,
Thank you for using the package! I have a hacky solution to your problem, that could be worth integrating it into methylKit.
library(methylKit)
## specify example files
file.list=list( system.file("extdata", "test1.myCpG.txt", package = "methylKit"),
system.file("extdata", "test2.myCpG.txt", package = "methylKit"),
system.file("extdata", "control1.myCpG.txt", package = "methylKit"),
system.file("extdata", "control2.myCpG.txt", package = "methylKit") )
## read into methylRawDB
mydblist = suppressMessages(methRead( file.list,
sample.id=list("t1","t2","c1","c2"),assembly="hg18",
pipeline="amp",treatment=c(1,1,0,0),dbtype = "tabix",dbdir="methylDB"))
## make larger object
mydblist <- methylRawListDB(mydblist[[1]],mydblist[[2]],mydblist[[3]],mydblist[[4]],
mydblist[[1]],mydblist[[2]],mydblist[[3]],mydblist[[4]],
mydblist[[1]],mydblist[[2]],mydblist[[3]],mydblist[[4]],
mydblist[[1]],mydblist[[2]],mydblist[[3]],mydblist[[4]],
treatment = c(rep(0,8),rep(1,8)))
## use profvis to monitor memory usage
library(profvis)
profvis::profvis({
## unite whole object
suppressMessages(methidhDB <- unite(mydblist))
## unite splits
suppressMessages(methidhDB1 <- unite(methylRawListDB(mydblist[1:10],treatment = getTreatment(mydblist)[1:10])))
suppressMessages(methidhDB2 <- unite(methylRawListDB(mydblist[11:16],treatment = getTreatment(mydblist)[11:16])))
## merge splits afterwards
dbpath <- gsub(".tbi","",methylKit::mergeTabix(tabixList = c(getDBPath(methidhDB1),getDBPath(methidhDB2)),dir = "methylDB/",filename = "newMethylBaseDB"))
## load merged splits
mybasedb <- readMethylBaseDB(dbpath = dbpath,dbtype = mydblist[[1]]@dbtype,
sample.ids = getSampleID(mydblist),assembly = mydblist[[1]]@assembly,
context = mydblist[[1]]@context,resolution = mydblist[[1]]@resolution,
treatment = mydblist@treatment,destranded = FALSE)
## check for equality
all.equal(mybasedb,methidhDB)
identical(getData(mybasedb),getData(methidhDB))
})
Please note that I did not do extensive testing, I only checked this with our example data, but I would really like to hear whether this solves your problem.
Best, Alex
Hi Alex,
Thank you so much for your reply. I just tried the solution provided. However, I encounter below error message
Error: 'mergeTabix' is not an exported object from 'namespace:methylKit'
Could you help me have a look at it and see if I made any mistake? Thanks!
PS My sessionInfo() is :
R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux
Matrix products: default BLAS: /gpfs/share/apps/R/3.5.1/lib64/R/lib/libRblas.so LAPACK: /gpfs/share/apps/R/3.5.1/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base
other attached packages: [1] methylKit_1.8.1 GenomicRanges_1.34.0 GenomeInfoDb_1.18.2 [4] IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.12.0 qvalue_2.14.1
[3] gtools_3.8.1 tidyselect_0.2.5
[5] reshape2_1.4.3 purrr_0.3.2
[7] splines_3.5.1 lattice_0.20-38
[9] colorspace_1.4-1 rtracklayer_1.42.2
[11] XML_3.98-1.20 rlang_0.3.4
[13] R.oo_1.22.0 pillar_1.4.1
[15] glue_1.3.1 R.utils_2.8.0
[17] BiocParallel_1.16.6 fastseg_1.28.0
[19] matrixStats_0.54.0 GenomeInfoDbData_1.2.0
[21] plyr_1.8.4 stringr_1.4.0
[23] zlibbioc_1.28.0 Biostrings_2.50.2
[25] munsell_0.5.0 gtable_0.3.0
[27] R.methodsS3_1.7.1 coda_0.19-2
[29] Biobase_2.42.0 Rcpp_1.0.1
[31] scales_1.0.0 limma_3.38.3
[33] DelayedArray_0.8.0 XVector_0.22.0
[35] Rsamtools_1.34.1 ggplot2_3.1.1
[37] stringi_1.4.3 dplyr_0.8.1
[39] numDeriv_2016.8-1.1 grid_3.5.1
[41] bitops_1.0-6 tools_3.5.1
[43] bbmle_1.0.20 magrittr_1.5
[45] lazyeval_0.2.2 RCurl_1.95-4.12
[47] tibble_2.1.3 crayon_1.3.4
[49] pkgconfig_2.0.2 Matrix_1.2-17
[51] MASS_7.3-51.4 data.table_1.12.2
[53] assertthat_0.2.1 emdbook_1.3.11
[55] R6_2.4.0 mclust_5.4.3
[57] GenomicAlignments_1.18.1 compiler_3.5.1
Hi Zhi,
Can you try to call it with am additional colon?
methylKit:::mergeTabix
Hi Alex,
The methylKit::mergeTabix step works now with additional ':'. I now got a new error as below. Could you help me with it?
Error in readMethylBaseDB(dbpath = dbpath, dbtype = methylbaseDB1@dbtype, : could not find function "readMethylBaseDB"
PS the library is loaded.
Sorry, please use methylKit:::readMethylBaseDB
.
These functions are not exported to the user, but are used internally by us developers. This is why you have to access them with methylKit:::
.
I got it working with two batches of my data. I will try it out now with the full data set and see how it goes. Will keep you posted. Thanks again for your help.
Hi Alex,
When I try running it with all data. I got following error:
Error in merge.data.table(res, tmp, by = c("V1", "V2", "V3", "V4"), all = all) : x has some duplicated column name(s): V8.x,V9.x,V10.x,V11.x,V12.x,V13.x,V14.x,V15.x,V16.x,V17.x,V18.x,V19.x,V20.x,V21.x,V22.x,V23.x,V24.x,V25.x,V26.x,V27.x,V28.x,V29.x,V30.x,V31.x,V32.x,V33.x,V34.x,V35.x,V36.x,V37.x,V38.x,V39.x,V40.x,V41.x,V42.x,V43.x,V44.x,V45.x,V46.x,V47.x,V48.x,V49.x,V50.x,V51.x,V52.x,V53.x,V54.x,V55.x,V56.x,V57.x,V58.x,V59.x,V60.x,V61.x,V62.x,V63.x,V64.x,V65.x,V66.x,V67.x,V68.x,V69.x,V70.x,V71.x,V72.x,V73.x,V74.x,V75.x,V76.x,V77.x,V78.x,V79.x,V80.x,V81.x,V82.x,V83.x,V84.x,V85.x,V86.x,V87.x,V88.x,V89.x,V90.x,V91.x,V92.x,V93.x,V94.x,V95.x,V96.x,V97.x,V98.x,V99.x,V100.x,V101.x,V102.x,V103.x,V104.x,V105.x,V106.x,V107.x,V108.x,V109.x,V110.x,V111.x,V112.x,V113.x,V114.x,V115.x,V116.x,V117.x,V118.x,V119.x,V120.x,V121.x,V122.x,V123.x,V124.x,V125.x,V126.x,V127.x,V128.x,V129.x,V130.x,V131.x,V132.x,V133.x,V134.x,V135.x,V136.x,V137.x,V138.x,V139.x,V140.x,V141.x,V142.x,V143.x,V144.x,V145.x,V146.x,V147.x,V148.x,V8.y,V9.y,V10.y,V11.y,V12.y,V13.y,V14.y,V15.y,V16.y,V17.y,V Calls: gsub ... lapply -> FUN -> merge -> merge -> merge.data.table
Not sure if it is something has been reported before.
Hi Alex,
I worked around previous error message by merging no more than 4 methylbaseDB objects once a time. Then jointly merged the merged methylbaseDB objects later. Thank you for your help!
Zhi
Sorry.. Just got a new error while trying to check the dimension of the merged object:
dim(getData(batch_all)) Error in paste(tabixRes[[1]], collapse = "\n") : result would exceed 2^31-1 bytes
Hi Zhi, Can you try to do this chromosome by chromosome ? What happens when you use chr1 and chr2 only, do you still get an error ? We would like to debug this as methylDB shouldn't have big problems with memory however we don't have a smallest possible dataset that recreates the error. If we know what parameters number of samples and number of CpGs per sample creates this error we can simulate a dataset or use your example.
Best, Altuna
On Tue, Jul 30, 2019 at 12:20 AM lz870718 [email protected] wrote:
Sorry.. Just got a new error while trying to check the dimension of the merged object:
dim(getData(batch_all)) Error in paste(tabixRes[[1]], collapse = "\n") : result would exceed 2^31-1 bytes
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/158?email_source=notifications&email_token=AAE32EIZUINEMU354ZGB27DQB5ULRA5CNFSM4IEQ7PLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3CFSNY#issuecomment-516184375, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE32EORZMIPX4NFLFGZXILQB5ULRANCNFSM4IEQ7PLA .
Hi Dr. Akalin,
I tried with a subset of samples (n=200) and CpG bases (n=4,749,070) with no problem. When I extended the function with all samples (n=500) and CpG bases (n>10,000,000), the error came up.
Would you recommend me, in my case, to use tiling windows analysis to bring down the number of rows in the methylbaseDB object? I am not sure if/how this will affect differential/association analysis downstream, thus your advice is greatly appreciated.
PS: I've already excluded CHH and CHG sites from my analysis.
Best, Zhi
yes, you can try tiling if it decreases the number of rows. I would try 100 or 500 bp tiles
Best, Altuna
On Tue, Jul 30, 2019 at 5:03 AM lz870718 [email protected] wrote:
Hi Dr. Akalin,
I tried with a subset of samples (n=200) and CpG bases (n=4,749,070) with no problem. When I extended the function with all samples (n=500) and CpG bases (n>10,000,000), the error came up.
Would you recommend me, in my case, to use tiling windows analysis to bring down the number of rows in the methylbaseDB object? I am not sure if/how this will affect differential/association analysis downstream, thus your advice is greatly appreciated.
PS: I've already excluded CHH and CHG sites from my analysis.
Best, Zhi
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/158?email_source=notifications&email_token=AAE32EIS237WN5DCTKPTBATQB6VOLA5CNFSM4IEQ7PLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3CTLDQ#issuecomment-516240782, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE32EM63ZV4IPAXTQI6OQ3QB6VOLANCNFSM4IEQ7PLA .
Hi Zhi,
to get the dimension of your tabix file you could try to use this for now:
df <- fread.gzipped(getDBPath(batch_all))
dim(df)
It is possible that you are still able to work with batch_all as a methylBaseDB object, if you decrease the chunk.size parameter in every function call that you make. Also tileMethylCounts works chromosome by chromosome so it should work for you.
Best, Alex
Hi Alex and Altuna,
Thank you for your help on this matter. I am trying to annotate the whole big methylBaseDB object with annotateWithGeneParts() function. But I got this error message again as below.
Error in paste(tabixRes[[1]], collapse = "\n") : result would exceed 2^31-1 bytes Calls: annotateWithGeneParts ... getTabixByChunk -> tabix2df -> fread -> paste0 -> paste Execution halted
I was thinking about subset the methylBaseDB object by different chromosomes, but didn't find a good way to do so. Any suggestions would be appreciated.
Zhi
Hi @lz870718 ,
Sorry for the late reply. I do not know whether this is still required, but we have an internal function called applyTbxByChr
defined here that applies a function on a tabix file per chromosome.
See here (https://github.com/al2na/methylKit/blob/9620cd686a14483f2158c75805c952d066c7b7f8/R/methylDBFunctions.R#L1667) and here (https://github.com/al2na/methylKit/blob/9620cd686a14483f2158c75805c952d066c7b7f8/R/methylDBFunctions.R#L2305) for examples.
Best, Alex