methylKit icon indicating copy to clipboard operation
methylKit copied to clipboard

add cores arg to methSeg

Open katwre opened this issue 7 years ago • 8 comments

Hi guys, I added parallel methSeg() with methylDB objects and non-tabix objects. I separated code for running fastseg and mclust into two auxiliary function .run.fastseg and .run.mclust. Parallelization is in the step of fastseg and each fastseg run is concatenated. For that, I added to return.type in applyTbxByChr "GRanges" to concatenate GRanges. I wrote some tests, but I think I maybe should add more of them Let me know what do you think about it Kasia

katwre avatar Jun 28 '18 16:06 katwre

I improved the code according to your suggestions besides @al2na suggestion about the if clause https://github.com/al2na/methylKit/pull/120#discussion-diff-199020615R31

katwre avatar Jun 29 '18 13:06 katwre

could we also comment the code wherever possible, please think about people who will maintain this in the future or your future selves. Certain things that are trivial are not going to be trivial after 3 months of not looking at the code.

al2na avatar Jun 29 '18 14:06 al2na

@al2na I added more comments, hope it's better now

katwre avatar Jul 02 '18 10:07 katwre

there is something wrong when join.neighbours=TRUE and initialize.on.subset!=1, I am checking it

katwre avatar Jul 02 '18 14:07 katwre

I checked if with methylRawDB and multiple cores is faster than using methylRaw object on example of data with ~350K Cs (two chromosomes) and methylRaw is faster. I don't know why. Maybe it depends on the size of the input, I will check that

b <- benchmark(methylRaw.cores.1 =methSeg(obj.methylraw, diagnostic.plot = F, join.neighbours = FALSE),
               methylRaw.cores.2 =methSeg(obj.methylraw, diagnostic.plot = F, join.neighbours = FALSE, cores=2),
               methylRawDB.cores.1 = methSeg(obj, diagnostic.plot = F, join.neighbours = FALSE),
               methylRawDB.cores.2 = methSeg(obj, diagnostic.plot = F, join.neighbours = FALSE, cores=2),
               replications=5,
               columns=c('test', 'replications', 'elapsed'))
> print(b)
                test replications elapsed
1   methylRaw.cores.1            5  39.026
2   methylRaw.cores.2            5  38.495
3 methylRawDB.cores.1            5  46.146
4 methylRawDB.cores.2            5  45.640

katwre avatar Jul 04 '18 08:07 katwre

please check datasets that have multiple chromosomes lets say at least 5 chromosomes, compare also memory consumption.

On Wed, Jul 4, 2018 at 10:55 AM katwre [email protected] wrote:

I checked if with methylRawDB and multiple cores is faster than using methylRaw object on example of data with ~350K Cs and methylRaw is faster. I don't know why. Maybe it depends on the size of the input, I will check that

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/pull/120#issuecomment-402412315, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm9ERdF218-iOSwujVjdGoOCvaGaiKcks5uDIL8gaJpZM4U7trW .

al2na avatar Jul 04 '18 09:07 al2na

I checked it using 5 chromosomes and it's not better.

> myRaw
methylRaw object with 3784497 rows
--------------
  chr   start     end strand coverage numCs numTs
1 chr21 9411552 9411552      +       45    12    33
2 chr21 9411553 9411553      -       70    27    43
3 chr21 9411784 9411784      +       31     4    27
4 chr21 9411785 9411785      -       46    12    34
5 chr21 9412099 9412099      +       26    15    11
6 chr21 9412100 9412100      -       35    16    19
--------------
  sample.id: id 
assembly: assembly 
context: CpG 
resolution: base 

library(rbenchmark)

b <- benchmark(methylRaw.cores.1 =methSeg(myRaw, diagnostic.plot = F, join.neighbours = FALSE),
               methylRaw.cores.5 =methSeg(myRaw, diagnostic.plot = F, join.neighbours = FALSE, mc.cores=5),
               methylRawDB.cores.1 = methSeg(mymethylRawDB, diagnostic.plot = F, join.neighbours = FALSE),
               methylRawDB.cores.5 = methSeg(mymethylRawDB, diagnostic.plot = F, join.neighbours = FALSE, mc.cores=5),
               replications=3,
               columns=c('test', 'replications', 'elapsed'))
> print(b)
test replications elapsed
1   methylRaw.cores.1            3 257.613
2   methylRaw.cores.5            3 259.420
3 methylRawDB.cores.1            3 295.970
4 methylRawDB.cores.5            3 297.785

thanks @alexg9010 for the suggestion to use profvis, but it didnt work for me, I got an error that I didn't what to do with. I used profmem instead and it showed that memory usage when there are parallel cores is smaller than without using multiple cores.

methylRaw.cores.1 = 47888 bytes 
methylRaw.cores.5 = 39656 bytes
methylRawDB.cores.1 = 151380128 bytes
methylRawDB.cores.5 = 121104112 bytes

katwre avatar Jul 16 '18 09:07 katwre

@al2na @alexg9010 I didn't manage to show that this method is faster. Should we close this pull request?

katwre avatar Aug 13 '18 14:08 katwre