multi-drug-embedding
multi-drug-embedding copied to clipboard
Method for drug repurposing from knowledge graphs and literature
Drug repurposing through joint learning on knowledge graphs and literature
Here, we developed a novel method that combines information in literature and structured databases, and applies feature learning to generate vector space embeddings. We apply our method to the identification of drug targets and indications for known drugs based on heterogeneous information about drugs, target proteins, and diseases. We demonstrate that our method is able to combine complementary information from both structured databases and from literature.
Below are the steps for the drugs repurposing pipleine
Requirements
- python 2.7.6
- numpy
- keras
- boost libraries for running multithreaded implementations of randomwalk.
Running
-
Build the graph as described in link
-
The output graph is in the data folder in this repository
-
Before generating the corpus, remove the
has-target
edges for (Drug target interactions) prediction, andhas-indication
edges for Drug indications prediction.
python remove_relation_links.py
- Generate the knowledge graph corpus from the edgelist after removing edges, run
./deepwalk ../data/edgelist_WalkingRDFOWL_has_indication_free.txt ../data/corpus_WalkingRDFOWL_has_indication_free.txt
- Run word2vec on the generated corpus
python word2vec_gensim.py
- Normalize the knowledge graph entities with the PubMed abstracts corpus by running
python normalize_text.py
-
Use the the generated corpus from step 5 with Word2Vec to create independent Pubmed abstracts embeddings.
-
Combine the generated corpus from step 5 with the knowledge graph corpus similar to the following and run Word2Vec on the combined corpus.
cat ../data/corpus_WalkingRDFOWL_has_indication_free.txt ../data/medline_abstracts_mapped_drugsrepo.txt > ../data/combined_corpus.txt
- Run word2Vec on the combined corpus.
- Run
Ind_ann_graph_common.py
and other scripts to train the Artificaial Neural Networks with different embeddings from the knowledge graph and PubMed abstracts available in the data folder.
Data
Knowledge graph and literature
The PubMed abstarcts used in this project was downloaded from [Pubtator] (ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator/), The normalization script can be used to normalize the knowledge graph and literature. The normalized corpus used in this study is available upon request.
The knowledge graph edgelist is edgelist_WalkingRDFOWL.txt
and the mapping to knowledge graph node is mapping_WalkingRDFOWL.txt
Embeddings
embeddings_WalkingRDFOWL_has_indication_free.txt
knowledge graph embeddings for predicting drug indications
embeddings_WalkingRDFOWL_has_targets_free.txt
knowledge graph embeddings for predicting drugs targets
drugs_text_embeddings.txt
, diseases_text_embeddings.txt
and genes_text_embeddings.txt
are Medline abstracts embeddings.
drugs_embeddings_combined_has_indication.txt
, diseases_embeddings_combined_has_indication.txt
and genes_embeddings_combined_has_indication.txt
are knowledge graph and Medline abstracts jointly trained.
Evaluations and Mapping
All generated embeddings and mapping data used to normalize Literature information to knowledge graph used in this project is available as python dictionary in the data folder. All drug indications drugs2ind_doid.dict
and drug targets drugs2tars_stitch.dict
evaluations are available as well.
The drug indications is from SIDER database. The drug target is from STITCH database. Chemicals alias from STITCH
was used to convert drugs mentions in text to STITCH
ID available in chemical_map.dict
.
Disease ontology was used to extract MESH
to DOID
mapping in mesh2doid.dict
and OMIM
to DOID
in omim2doid.dict
Predictions
We make drug indications predictions for approved drugs from SIDER available predicted_indications_approved_processed.tsv
in the data folder.
The first column is the drug ID and drug name, indications disease ontology ID and name, and the prediction score. The full list of the tested drugs and the predicted ranks for indications and targets are included as indications_ranked_graph.txt
, indications_ranked_concat_embeddings.txt
and indications_ranked_concat_corpus.txt
, etc.
The first is the drug PubChem ID followed by the diseases and their ranks.
For the complete data including the mapping files, embeddings and normalized PubMed corpus, please download from here
Sample results
The tables below illustrates few examples of the method's ablility to combine complemnetary information betwene the knowledge graph and the literature which result in improved predictions ranks for drugs indications and targets
Drug | Indication | Knowledge graph | Pubmed abstracts | Concatenated embeddings | Concatenated corpora |
---|---|---|---|---|---|
CID00002678 (Cetirizine) | allergic hypersensitivity disease (DOID:1205) | ranked 34 | ranked 4 | ranked 1 | ranked 10 |
CID05464096 (Ramiprilat) | cerebrovascular disease (DOID:6713) | ranked 76 | ranked 1 | ranked 1 | ranked 3 |
CID00002786 (Clindamycin) | impetigo (DOID:8504) | ranked 16 | ranked 11 | ranked 1 | ranked 1 |
CID00002658 (Cefuroxime) | pneumonia (DOID:552) | ranked 46 | ranked 7 | ranked 3 | ranked 1 |
CID00004091 (Metformin) | diabetes mellitus (DOID:9351) | ranked 3 | ranked 6 | ranked 1 | ranked 3 |
CID00003310 (Etoposide) | leukemia (DOID:1240) | ranked 177 | ranked 3 | ranked 11 | ranked 1 |
Drug | Target (gene Entrez) | Knowledge graph | Pubmed abstracts | Concatenated embeddings | Concatenated corpora |
---|---|---|---|---|---|
CID00004048 (Megestrol acetate) | 2908 | ranked 13 | ranked 10 | ranked 6 | ranked 4 |
CID00004934 (Propantheline) | 1131 | ranked 91 | ranked 13 | ranked 1 | ranked 1 |
CID00003155 (Dothiepin) | 1129 | ranked 62 | ranked 26 | ranked 19 | ranked 1 |
CID00004666 (Paclitaxel) | 7157 | ranked 5 | ranked 3 | ranked 5 | ranked 2 |
CID00003640 (Cortisol) | 1551 | ranked 13 | ranked 20 | ranked 3 | ranked 10 |
CID00004594 (Omeprazole) | 1544 | ranked 53 | ranked 18 | ranked 7 | ranked 2 |