bigsnpr icon indicating copy to clipboard operation
bigsnpr copied to clipboard

Add support for plink2 binary format dosage data (pgen)

Open sritchie73 opened this issue 3 years ago • 3 comments

E.g. via a snp_readPgen() function analogous to snp_readBed().

The plink2 binary format (https://www.cog-genomics.org/plink/2.0/input#pgen) has several advantages over the plink1 binary format (bed/bim/fam):

  • Data are stored as probabilistic dosages, rather than hard call genotypes (i.e. floating point between 0 and 2)
  • There's less missing data compared to the hard call genotype data
  • It's more compact than plink1 binary format (e.g. the full UKB dataset takes 2.4TB in pgen format, and 11TB in the bed format).

Disavantages:

  • The plink2 binary format is still in draft specification so may be subject to change. That being said, I've been working with pgen files for a few years now with numerous updates to plink2 and there have been no such changes in that time, so I gather the format is relatively mature.

sritchie73 avatar Aug 09 '21 13:08 sritchie73

Please see https://github.com/privefl/bigsnpr/issues/176#issuecomment-791629700.

privefl avatar Aug 09 '21 13:08 privefl

If anyone is willing to help implementing this, please discuss here.

privefl avatar Sep 02 '21 07:09 privefl

In the meantime, you can have a look at the last point of https://privefl.github.io/bigsnpr-extdoc/inputs-and-formats.html#getting-FBM for a workaround.

privefl avatar Nov 21 '22 20:11 privefl