methylKit
methylKit copied to clipboard
add cores arg to methSeg
Hi guys,
I added parallel methSeg() with methylDB objects and non-tabix objects.
I separated code for running fastseg and mclust into two auxiliary function .run.fastseg and .run.mclust. Parallelization is in the step of fastseg and each fastseg run is concatenated. For that, I added to return.type in applyTbxByChr "GRanges" to concatenate GRanges.
I wrote some tests, but I think I maybe should add more of them
Let me know what do you think about it
Kasia
I improved the code according to your suggestions besides @al2na suggestion about the if clause https://github.com/al2na/methylKit/pull/120#discussion-diff-199020615R31
could we also comment the code wherever possible, please think about people who will maintain this in the future or your future selves. Certain things that are trivial are not going to be trivial after 3 months of not looking at the code.
@al2na I added more comments, hope it's better now
there is something wrong when join.neighbours=TRUE and initialize.on.subset!=1, I am checking it
I checked if with methylRawDB and multiple cores is faster than using methylRaw object on example of data with ~350K Cs (two chromosomes) and methylRaw is faster. I don't know why. Maybe it depends on the size of the input, I will check that
b <- benchmark(methylRaw.cores.1 =methSeg(obj.methylraw, diagnostic.plot = F, join.neighbours = FALSE),
methylRaw.cores.2 =methSeg(obj.methylraw, diagnostic.plot = F, join.neighbours = FALSE, cores=2),
methylRawDB.cores.1 = methSeg(obj, diagnostic.plot = F, join.neighbours = FALSE),
methylRawDB.cores.2 = methSeg(obj, diagnostic.plot = F, join.neighbours = FALSE, cores=2),
replications=5,
columns=c('test', 'replications', 'elapsed'))
> print(b)
test replications elapsed
1 methylRaw.cores.1 5 39.026
2 methylRaw.cores.2 5 38.495
3 methylRawDB.cores.1 5 46.146
4 methylRawDB.cores.2 5 45.640
please check datasets that have multiple chromosomes lets say at least 5 chromosomes, compare also memory consumption.
On Wed, Jul 4, 2018 at 10:55 AM katwre [email protected] wrote:
I checked if with methylRawDB and multiple cores is faster than using methylRaw object on example of data with ~350K Cs and methylRaw is faster. I don't know why. Maybe it depends on the size of the input, I will check that
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/pull/120#issuecomment-402412315, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm9ERdF218-iOSwujVjdGoOCvaGaiKcks5uDIL8gaJpZM4U7trW .
I checked it using 5 chromosomes and it's not better.
> myRaw
methylRaw object with 3784497 rows
--------------
chr start end strand coverage numCs numTs
1 chr21 9411552 9411552 + 45 12 33
2 chr21 9411553 9411553 - 70 27 43
3 chr21 9411784 9411784 + 31 4 27
4 chr21 9411785 9411785 - 46 12 34
5 chr21 9412099 9412099 + 26 15 11
6 chr21 9412100 9412100 - 35 16 19
--------------
sample.id: id
assembly: assembly
context: CpG
resolution: base
library(rbenchmark)
b <- benchmark(methylRaw.cores.1 =methSeg(myRaw, diagnostic.plot = F, join.neighbours = FALSE),
methylRaw.cores.5 =methSeg(myRaw, diagnostic.plot = F, join.neighbours = FALSE, mc.cores=5),
methylRawDB.cores.1 = methSeg(mymethylRawDB, diagnostic.plot = F, join.neighbours = FALSE),
methylRawDB.cores.5 = methSeg(mymethylRawDB, diagnostic.plot = F, join.neighbours = FALSE, mc.cores=5),
replications=3,
columns=c('test', 'replications', 'elapsed'))
> print(b)
test replications elapsed
1 methylRaw.cores.1 3 257.613
2 methylRaw.cores.5 3 259.420
3 methylRawDB.cores.1 3 295.970
4 methylRawDB.cores.5 3 297.785
thanks @alexg9010 for the suggestion to use profvis, but it didnt work for me, I got an error that I didn't what to do with. I used profmem instead and it showed that memory usage when there are parallel cores is smaller than without using multiple cores.
methylRaw.cores.1 = 47888 bytes
methylRaw.cores.5 = 39656 bytes
methylRawDB.cores.1 = 151380128 bytes
methylRawDB.cores.5 = 121104112 bytes
@al2na @alexg9010 I didn't manage to show that this method is faster. Should we close this pull request?