Reduce memory usage of facets
Hi,
In my laboratory we use facets on many whole genome human dataset, on this data facets have a huge memory usage, approximately 150 Gib.
The purpose of this PR is to try to reduce facets memory usage, for this I replace some classic R data.frame by tidyverse tibble data-structure, I also use tydiverse pipe syntaxe to perform some operation on this tibble.
With all this change, I divide memory usage by 2.
On my test dataset result is same between my PR and version v0.6.1, but maybe I miss some stuff.
I'm not a good R developer, maybe I include some stupid mistake, so if you want just take the idea of my change and rewrite it please do it.
Thank
Can you give me some breakdown of where this memory explosion occurs. My back of the envelope calculation says
R:> x = rnorm(12e6) # one locus every 250 bases across 3000 Megabase
R:> format(object.size(x), units="Mb")
[1] "91.6 Mb"
The jointseg data frame has 16 columns but even that wouldn't translate to 150Gib memory use.
Have you tried using the readSnpMatrixDT.R in path/facets/extRfns/ to read in the data?
Thanks
With v0.6.1 the memory peak is during file reading, use readSnpMatrixDT.R like my change solve this issue.
But another peak occur during preProcSample I assume, it's more specifically in procSnps (some duplication, column creation, calling of Fortran code and filtration not run in place).
With v0.6.1 and readSnpMatrixDT.R memory usage is 85Gib, my version use 70Gib.
Can you tell me how big is the pileup matrix i.e. how many loci? And how many end up in jointseg? Thanks.
The pileup matrix contains 546,700,164 loci.
To evaluate number of jointseg I consider $jointseg in output produce by procSample, I get 5,583,831 jointseg.
Given that the whole genome is around 3 Gigabase, the pileup seems to have a locus every 6 bases. That is a lot of redundant data as they will be highly serially correlated. You can DM me if you want to talk about this further.
I will look into how your code can be used to reduce the memory use of procSnps.
Thanks