bigsnpr
bigsnpr copied to clipboard
Add support for plink2 binary format dosage data (pgen)
E.g. via a snp_readPgen()
function analogous to snp_readBed()
.
The plink2 binary format (https://www.cog-genomics.org/plink/2.0/input#pgen) has several advantages over the plink1 binary format (bed/bim/fam):
- Data are stored as probabilistic dosages, rather than hard call genotypes (i.e. floating point between 0 and 2)
- There's less missing data compared to the hard call genotype data
- It's more compact than plink1 binary format (e.g. the full UKB dataset takes 2.4TB in pgen format, and 11TB in the bed format).
Disavantages:
- The plink2 binary format is still in draft specification so may be subject to change. That being said, I've been working with pgen files for a few years now with numerous updates to plink2 and there have been no such changes in that time, so I gather the format is relatively mature.
Please see https://github.com/privefl/bigsnpr/issues/176#issuecomment-791629700.
If anyone is willing to help implementing this, please discuss here.
In the meantime, you can have a look at the last point of https://privefl.github.io/bigsnpr-extdoc/inputs-and-formats.html#getting-FBM for a workaround.