SynonymNet
SynonymNet copied to clipboard
Entity Synonym Discovery via Multipiece Bilateral Context Matching (IJCAI'20) https://arxiv.org/abs/1901.00056
Entity Synonym Discovery via Multipiece Bilateral Context Matching
This project provides source code and data for SynonymNet, a model that detects entity synonyms via multipiece bilateral context matching.
Details about SynonymNet can be accessed here, and the implementation is based on the Tensorflow library.
Quick Links
- Installation
- Usage
- Data
- Results
- Acknowledgements
Installation
For training, a GPU is recommended to accelerate the training speed.
Tensorflow
The code is based on Tensorflow 1.5 and can run on Tensorflow 1.15.0. You can find installation instructions here.
Dependencies
The code is written in Python 3.7. Its dependencies are summarized in the file requirements.txt.
tensorflow_gpu==1.15.0
numpy==1.14.0
pandas==0.25.1
gensim==3.8.1
scikit_learn==0.21.2
You can install these dependencies like this:
pip3 install -r requirements.txt
Usage
-
Run the model on Wikipedia+Freebase dataset with the siamese architecture and the default hyperparameter settings
cd src
python3 train_siamese.py --dataset=wiki -
For all available hyperparameter settings, use
python3 train_siamese.py -h -
Run the model on Wikipedia+Freebase dataset with the triplet architecture and the default hyperparameter settings
cd src
python3 train_triplet.py --dataset=wiki
Data
Format
Data
Each dataset is a folder under the ./input_data folder, where each sub-folder indicates a train/val/test split:
./data
└── wiki
├── train
| ├── siamese_contexts.txt
| └── triple_contexts.txt
├── valid
| ├── siamese_contexts.txt
| └── triple_contexts.txt
├── test
| ├── knn-siamese_contexts.txt
| ├── knn_triple_contexts.txt
| ├── siamese_contexts.txt
| └── triple_contexts.txt
└── skipgram-vec200-mincount5-win5.bin
└── fasttext-vec200-mincount5-win5.bin
└── in_vovab (build during training)
In each sub-folder,
-
siamese_contexts.txtfile contains entities and contexts for the siamese architecture. Each line has five columns, seperated by \t:entity_a \t entity_b \t context_a1@@context_a2...context_an \t context_b1@@context_b2@@...@@context_bn \t label.entity_aandentity_bindicate two entities. e.g.u.s._government||m.01bqks||andunited_states||m.01bqks||.- The next two columns indicate the contexts of two entities. e.g.
context_a1@@context_a2...context_anindicates n pieces of contexts whereentity_ais mentioned.@@is used to seperate contexts. labelis a binary value indicating synonymity.
-
triple_contexts.txtfile contains entities and contexts for the triplet architecture. Each line has six columns, seperated by \t:entity_a \t entity_pos \t entity_neg \t context_a1@@context_a2...context_an \t context_pos_1@@context_pos_2@@...@@context_pos_p \t context_neg_1@@context_neg_2@@...@@context_neg_q.
whereentity_adenotes one entity andentity_posdenotes a synonym entity ofentity_aandentity_negas a negative sample ofentity_a. -
*-vec200-mincount5-win5.binis a binary file stores the pre-trained word embedding trained using the corpus in the dataset. -
in_vocabis a vocabulary file generated automatically during training.
Download
Pre-trained word vectors and datasets can be downloaded here:
| Dataset | Link |
|---|---|
| Wikipedia + Freebase | https://drive.google.com/open?id=1uX4KU6ws9xIIJjfpH2He-Yl5sPLYV0ws |
| PubMed + UMLS | https://drive.google.com/open?id=1cWHhXVd_Pb4N3EFdpvn4Clk6HVeWKVfF |
Work on your own data
Prepare and organize your dataset in a folder according to the format and put it under ./input_data/ and use --dataset=foldername during training.
For example, your dataset is ./input_data/mydata, then you need to use the flag --dataset=mydata for train_triplet.py.
Your dataset should be seperated to three folders - train, test, and valid, which is named 'train', 'test', and 'valid' by default setting of train_triplet.py or train_siamese.py.
Reference
@inproceedings{zhang2020entity,
title={Entity Synonym Discovery via Multipiece Bilateral Context Matching},
author={Zhang, Chenwei and Li, Yaliang and Du, Nan and Fan, Wei and Yu, Philip S},
booktitle={Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI)},
year={2020}
}