Disease_gene_prioritization_GCN
Disease_gene_prioritization_GCN copied to clipboard
PGCN: Disease gene prioritization by disease and gene embedding through GCN
PGCN: Disease gene prioritization by disease and gene embedding through GCN
Disease gene prioritization is a fundamental step towards molecular diagnosis and treatment of diseases. This problem is highly challenging due to the very limited yet noisy knowledge of genes, diseases and, even more, on their associations. Despite the development of computational methods for disease gene prioritization, the performance of the existing methods is limited by manually-crafted features, network topology, or pre-defined rules of data fusion. Here we propose a novel graph convolutional network-based disease gene prioritization method, PGCN, through the systematic embedding of the heterogeneous network made by genes and diseases, as well as their individual features. The embedding learning model and the association prediction model are trained together in an end-to-end manner. We compared PGCN with five state-of-the-art methods on the Online Mendelian Inheritance in Man (OMIM) dataset, by challenging them on recovering missing associations, and on discovering associations for novel genes and/or diseases that are not seen in the training. Results show the significant improvements of PGCN over the existing methods. We further demonstrate that our embedding has biological meaning and can capture functional groups of genes.
More details can be referred to the paper.
@article{li2019pgcn,
title={PGCN: Disease gene prioritization by disease and gene embedding through graph convolutional neural networks},
author={Li, Yu and Kuwahara, Hiroyuki and Yang, Peng and Song, Le and Gao, Xin},
journal={bioRxiv},
pages={532226},
year={2019},
publisher={Cold Spring Harbor Laboratory}
}
Overview
Tested environment
- Centos 7
- Python 3.6.7
Install requirements
All the related packages have been summarized in requirements.txt. One can install all the packages with following command.
pip install -r requirements.txt
(better to construct a virtual environment using conda and install the package inside the environment)
Download the data
Due to the limit of the file size on Github, we store the data on Google Drive. Please download the data here: data.
Run the code
One can run the code using the following command after configuring the environment and downloading the data.
python main_prioritization.py
Result
The prediction matrix file can be downloaded here: result.
Here is the embedding clustering result. For more explanation, please refer to the manuscript.
More explanation
For calculating BEDROC, here we provide the function from the skchem package for the reference. For more accurate calculation, one can output the prediction and use R packages to do the calculation.
Credits
We would like to thank for the SNAP group for open-sourcing the decagon code: decagon.
This tool is for academic purposes and research use only. Any commercial use is subject for authorization from King Abdullah University of Science and technology “KAUST”. Please contact us at [email protected].