hetionet icon indicating copy to clipboard operation
hetionet copied to clipboard

Connecting GWAS database

Open healthbitai opened this issue 4 years ago • 3 comments

Hi Daniel,

My name is Nilesh Dharajiya, MD, and I am a molecular pathologist by training. I came across het.io recently and am very impressed by it. I have been playing with Neo4j since last 3 years and like the prospect of graph db in medical science. I saw that Hetionet combines data from 29 databases, which does not include GWAS database.

However, your abstract "Heterogeneous network link prediction prioritizes disease-associated genes” shows 698 associations extracted from GWAS catalog.

Have you every tried margin gwas data into the het.io graph db? I am specially looking for detecting relation ships between clinical symptoms to disease to genes and using calculating polygenic risk scores from gwas data so that with a symptoms, I can go all the way to genes involved in the pathogenesis. Is this possible? Do you know anyone who has done this?

Looking forward to hearing from you.

Best regards,

Nilesh

healthbitai avatar Mar 26 '20 22:03 healthbitai

However, your abstract "Heterogeneous network link prediction prioritizes disease-associated genes” shows 698 associations extracted from GWAS catalog.

For clarification, there are two main studies we've conducted regarding edge prediction on hetnets. From https://het.io/about/#cite:

image

Hetionet v1.0 was created as part of Project Rephetio, i.e. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. So you can mostly ignore "Heterogeneous network link prediction prioritizes disease-associated genes" and just focus on Hetionet v1.0.

I saw that Hetionet combines data from 29 databases, which does not include GWAS database.

GWAS data does make it into Hetionet. Copying from the methods:

Disease--associates--Gene edges were extracted from the GWAS Catalog [130], DISEASES [131,132], DisGeNET [133,134], and DOAF [135,136]. The GWAS Catalog compiles disease--SNP associations from published GWAS [137]. We aggregated overlapping loci associated with each disease and identified the mode reported gene for each high confidence locus [138,139]. DISEASES integrates evidence of association from text mining, curated catalogs, and experimental data [140]. Associations from DISEASES with integrated scores ≥ 2 were included after removing the contribution of DistiLD. DisGeNET integrates evidence from over 10 sources and reports a single score for each association [141,142]. Associations with scores ≥ 0.06 were included. DOAF mines Entrez Gene GeneRIFs (textual annotations of gene function) for disease mentions [143]. Associations with 3 or more supporting GeneRIFs were included.

The most important supplemental discussion on how we processed the GWAS catalog is at https://doi.org/10.15363/thinklab.d80.

I am specially looking for detecting relation ships between clinical symptoms to disease to genes

Cool. One place to start would be putting a symptom into https://het.io/search/ and then subsetting to genes for the target node:

Screenshot from 2020-03-27 09-57-38

This is a more manual exploratory approach. There are also more automated high-throughput approaches you could do with some scripting / programming.

dhimmel avatar Mar 27 '20 13:03 dhimmel

Thank you very much Daniel. I would like to learn more about automated high-throughput approaches for querying the database. Can you guide me to it?

Also, in the search, I added breast cancer in source node, but the target nodes do not show up common predisposition genes like brca1 and brca2 etc. Why is that?

Best,

Nilesh

healthbitai avatar Mar 27 '20 17:03 healthbitai

I added breast cancer in source node, but the target nodes do not show up common predisposition genes like brca1 and brca2 etc. Why is that?

The approach does find many types of paths that occur more than expected by chance between breast cancer and BRCA1 and BRCA2. So I think the question is more why don't BRCA1 and BRCA2 show up as the top result for breast cancer:

Screenshot from 2020-03-28 09-00-05

One reason is that the metric we're ranking by is simplistic. The search result ranking is by number of significant types of paths. It does not take into account how significant those types of paths are. That being said, the top result of MYC has 19 significant types of paths (metapaths), while BRCA1 has 12... so it's not that far from the top.

Will make another comment to address the rest of your questions.

dhimmel avatar Mar 28 '20 13:03 dhimmel