polyfun icon indicating copy to clipboard operation
polyfun copied to clipboard

about missing annotations for rare variants

Open szhan1000 opened this issue 1 year ago • 3 comments

Dear Omer,

I am planning to apply PolyFun + SuSiE by including our own functional annotation data (~10 different types) predicted by deep learning model. However, we only have predictions for common SNPs (particularly ~10 millions annotation SNPs for the standard S-LSDC) and not for rare variants. I wonder if I should change the v.2.2.UKB of the baseline-LF model to include only common SNPs? or is there anyway I can include functional annotations that are only available for certain MAF classes (e.g. common SNPs). The GWAS summary statistics I am trying to fine map is schizophrenia PGC3 that include only common SNPs (MAF>0.01).

Any advice will be appreciated!

Thanks, Shizhong

szhan1000 avatar Oct 11 '22 19:10 szhan1000

Hi @szhan1000,

This is a good question -- what you're asking is more conceptual than technical. If you're going to apply fine-mapping using only common SNPs, there's a good chance that you'll get wrong results because the true causal SNPs are not in the data... I'm afraid I can't help with this issue --- fine-mapping ideally needs access to rare SNPs.

Technically, you need to make a decision how you want to treat rare SNPs. You can either exclude them (and face the consequences I mentioned above) or you can include them and assume that none of them belongs in any of your predicted annotations. Neither option is perfect, but you need to weight the pros and cons of each approach...

I hope this helps, please let me know if not!

omerwe avatar Oct 12 '22 06:10 omerwe

Hi Omer, thank you for your quick response! I have two thoughts: 1) most GWAS summary statistics, especially those of meta-analysis conducted by large consortium such as PGC, include only common SNPs (MAF>0.01). I think the hypothesis is that most causal variants are still common SNPs, so it will be nice if PolyFun can be adapted to only common SNPs. 2) I am actually not sure about this point, but just a vague feeling, that LD score regression is based on the LD between common SNPs (e.g., Hapmap SNPs) and SNPs within annotation regions. I wonder if rare SNPs within annotation regions could contribute very little to the LD score, and thus will not contribute much to the heritability estimation?

szhan1000 avatar Oct 12 '22 14:10 szhan1000

Hi @szhan1000,

  1. I'm not sure that most common SNPs are causal. You might want to look at some papers (e.g. Gazal et al. 2018 Nat Genet , Zeng et al. 2018 Nat Genet, Schoech et al. 2019 Nat Com, Wainscetein et al. Nat Genet 2022). These papers argue that up to 50% of genetic heritability may be driven by low-frequency SNPs.

In any case, whether that's true or not, you're welcome to apply to PolyFun to only common SNPs. You will get a result. Whether this result is reliable or not is up to debate, but this is not a technical issue --- it's a conceptual issue.

  1. LD score regression is based on whole-genome sequencing data that takes rare SNPs into account. The papers I listed above include more details. As I mentioned above, these papers indicate that low-frquency SNPs causally explain between 10%-50% of heritability of various traits.

Sorry I can't be more helpful. I realize there's nothing you can do about it because you don't have access to low-frquency SNPs, but I would keep these caveats in mind...

Best,

Omer

omerwe avatar Oct 12 '22 18:10 omerwe