sgkit
sgkit copied to clipboard
Summary statistics IO and methods
The MRC IEU at Bristol has a specification for storing GWAS summary statistics in a VCF file.
While I certainly have mixed feelings about using VCF files as a container format, they have done the hard work of providing tens of thousands of GWAS summary statistics VCFs at the OpenGWAS project.
There are more details in
- The MRC IEU OpenGWAS data infrastructure (2020)
- The variant call format provides efficient and robust storage of GWAS summary statistics (2021)
It would be great to figure out how to map the data in these GWAS VCF files to the sgkit
data model and to write some methods on top of them.
Some additional resources on other approaches to file formats for summary stats
- https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics
- https://github.com/jinghuazhao/SUMSTATS
- https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format
https://github.com/MRCIEU/pygwasvcf is Python code to parse GWAS-VCF files but it's built with pysam
rather than cyvcf2
, unfortunately.
So, looking at a few example GWAS-VCF files, they're just putting per-variant sumstats into the SAMPLE fields. It appears some files use the INFO field for variant-specific metadata like minor allele frequency that we might want to pick up as well, but otherwise, I don't think parsing is going to be too challenging.
The hard part for us is figuring out if we want to define a blessed data model for sumstats and start adding operations that operate upon it.
There is this humanbase tool from Olga Troyanskaya's Lab which runs a NetWAS for you if you provide it sumstats. The docs describe the 3 formats it will let you provide them in: vegas, forge, and PLINK. I don't know anything more about them but they may be worth considering.
Interesting, NetWAS seems to operate on per-gene summary statistics, rather than per-variant. It would be interesting to hear from the Bristol team if they've considered computing per-gene summary statistics as part of their OpenGWAS work.
Another entry in the sumstats library and formats space:
- Paper: MungeSumstats: A Bioconductor package for the standardisation and quality control of many GWAS summary statistics (2021)
- Docs: https://neurogenomics.github.io/MungeSumstats/
- Library of sumstats to download: https://github.com/mikegloudemans/gwas-download
- Analysis of sumstats formats: https://al-murphy.github.io/SumstatFormats/
New standard for summary statistics https://ebispot.github.io/gwas-blog/new-standard-for-gwas-summary-statistics