sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Summary statistics IO and methods

Open hammer opened this issue 3 years ago • 7 comments

The MRC IEU at Bristol has a specification for storing GWAS summary statistics in a VCF file.

While I certainly have mixed feelings about using VCF files as a container format, they have done the hard work of providing tens of thousands of GWAS summary statistics VCFs at the OpenGWAS project.

There are more details in

It would be great to figure out how to map the data in these GWAS VCF files to the sgkit data model and to write some methods on top of them.

hammer avatar Jan 17 '21 14:01 hammer

Some additional resources on other approaches to file formats for summary stats

  • https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics
  • https://github.com/jinghuazhao/SUMSTATS
  • https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format

hammer avatar Jan 28 '21 17:01 hammer

https://github.com/MRCIEU/pygwasvcf is Python code to parse GWAS-VCF files but it's built with pysam rather than cyvcf2, unfortunately.

hammer avatar Feb 18 '21 20:02 hammer

So, looking at a few example GWAS-VCF files, they're just putting per-variant sumstats into the SAMPLE fields. It appears some files use the INFO field for variant-specific metadata like minor allele frequency that we might want to pick up as well, but otherwise, I don't think parsing is going to be too challenging.

The hard part for us is figuring out if we want to define a blessed data model for sumstats and start adding operations that operate upon it.

hammer avatar Feb 18 '21 21:02 hammer

There is this humanbase tool from Olga Troyanskaya's Lab which runs a NetWAS for you if you provide it sumstats. The docs describe the 3 formats it will let you provide them in: vegas, forge, and PLINK. I don't know anything more about them but they may be worth considering.

eric-czech avatar Apr 22 '21 15:04 eric-czech

Interesting, NetWAS seems to operate on per-gene summary statistics, rather than per-variant. It would be interesting to hear from the Bristol team if they've considered computing per-gene summary statistics as part of their OpenGWAS work.

hammer avatar Apr 22 '21 15:04 hammer

Another entry in the sumstats library and formats space:

hammer avatar Jun 22 '21 15:06 hammer

New standard for summary statistics https://ebispot.github.io/gwas-blog/new-standard-for-gwas-summary-statistics

hammer avatar Jul 21 '22 02:07 hammer