peddy
peddy copied to clipboard
Runtime for large datasets.
Hi,
I am trying to generate some cohort metrics for QC steps via peddy. My sample size is about 187000. I have provided the gz zipped VCF and fam (PLINK format) file for these samples as input. On running the command for the QC plots, all sample id are listed and a terminal output "ped_check" appears. However, there is no progress beyond this stage, and the process continues to run beyond 24 hrs mark.
I have executed this run on a HPC node: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz RAM: 180 Gb.
Is there a limitation on the input sample size?
187 thousand samples!? That is too big for peddy. You might try somalier on batches of ~20 thousand at a time.
Ok, I'll give that a try. Thank you.
Is there any procedure where I can merge the intermediate files of the sample subsets so as to generate results for the whole sample set?
do you mean for peddy? no.
i would just use somalier for 20K at a time. in order to compare all pairwise combinations for your samples, you'd need to do: (187K*187K) 34,969,000,000 pairwise comparisons, and have multiple matrices with that many entries. You might be able to do all samples at once on a machine with 1TB of memory to do all at once with somalier.