peddy icon indicating copy to clipboard operation
peddy copied to clipboard

Runtime for large datasets.

Open sdsilva10 opened this issue 3 years ago • 4 comments

Hi,

I am trying to generate some cohort metrics for QC steps via peddy. My sample size is about 187000. I have provided the gz zipped VCF and fam (PLINK format) file for these samples as input. On running the command for the QC plots, all sample id are listed and a terminal output "ped_check" appears. However, there is no progress beyond this stage, and the process continues to run beyond 24 hrs mark.

I have executed this run on a HPC node: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz RAM: 180 Gb.

Is there a limitation on the input sample size?

sdsilva10 avatar Sep 29 '21 12:09 sdsilva10

187 thousand samples!? That is too big for peddy. You might try somalier on batches of ~20 thousand at a time.

brentp avatar Sep 29 '21 13:09 brentp

Ok, I'll give that a try. Thank you.

sdsilva10 avatar Oct 01 '21 06:10 sdsilva10

Is there any procedure where I can merge the intermediate files of the sample subsets so as to generate results for the whole sample set?

sdsilva10 avatar Oct 01 '21 09:10 sdsilva10

do you mean for peddy? no.

i would just use somalier for 20K at a time. in order to compare all pairwise combinations for your samples, you'd need to do: (187K*187K) 34,969,000,000 pairwise comparisons, and have multiple matrices with that many entries. You might be able to do all samples at once on a machine with 1TB of memory to do all at once with somalier.

brentp avatar Oct 01 '21 10:10 brentp