KBQARelationLearning
KBQARelationLearning copied to clipboard
Code for our EMNLP 2021 paper - Large-Scale Relation Learning for Question Answering over Knowledge Bases with Pre-trained Language Models
KBQARelationLearning
Code for our EMNLP 2021 paper - Large-Scale Relation Learning for Question Answering over Knowledge Bases with Pre-trained Language Models
Requirements
Install the requirements specified in requirements.txt
:
torch==1.4.0
transformers==4.3.3
tqdm==4.58.0
numpy==1.19.2
tensorboardX==2.1
matplotlib==3.3.4
Data preprocessing
Preparing WebQSP Dataset
We obtain and process WebQSP dataset using the scripts released by GraftNet. The scripts will firstly download the original WebQSP dataset as well as the entity links from STAGG, then they compute the relation and question embeddings using Glove embeddings and run the edge-weighted Personal Page Rank (PPR) algorithm to retrieve a subgraph of freebase for each question. The maximum number of retrieved entities is set to 500.
Our preprocessing scripts are in the graftnet_preprocessing
folder (We modified the original scripts, making them runnable in python3 environment).
Preprocessing WebQSP for BERT-based KBQA
To solve KBQA task with BERT, we need to preprocessing the WebQSP dataset, making them a question-context matching task. The model is then trained to predict whether the given candidate entity is the answer of the question. The preprocessing includes:
- To make use of the textual information of entities and relations in KB facts, we need to find the entity name given the freebase mid (e.g.
<fb:m.03rk0>
). We firstly download the freebase data dumps todata/freebase/freebase-rdf-latest.gz
, then use the scriptdata/freebase/generate_mid_to_name_mapping.py
to obtain the mappingdata/freebase/mid2name.json
. - Then we use the script
data/webqsp_name_expanded/process.py
to process the WebQSP dataset in the question-context matching form.
Preparing Dataset for Relation Learning
We firstly download the WebRED dataset to data/webred/WebRED
folder and download the FewRel dataset to data/webred_fewrel_matching/fewrel/
folder.
For Relation Extraction (RE) task, we use the script data/webred/preprocess.py
and the processed datasets are in the data/webred
folder.
For Relation Matching (RM) task, we use the script data/webred_matching/preprocess.py
and the processed datasets are in the data/webred_matching
folder. To further make use of the FewRel dataset, we use the script data/webred_fewrel_matching/fewrel/process.py
to generate the RM dataset that uses both WebRED and FewRel as data sources in the data/webred_fewrel_matching
folder.
For Relation Reasoning (RR) task, we modified the script from BERTRL to data/freebase_pretraining_bertrl/load_data_webqsp.py
, and the generated dataset will be in the data/freebase_pretraining_bertrl/webqsp
folder.
Training & Evaluating
Run the scripts in scripts
folder.
Note: we create special tokens by modifying the vocab.txt
of the pre-trained BERT as following:
[unused0] -> [unknown entity]
[unused1] -> [path separator]
[unused2] -> [start of entity]
[unused3] -> [end of entity]
[unused4] -> [self]
[unused5] -> [start of head entity]
[unused6] -> [end of head entity]
[unused7] -> [start of tail entity]
[unused8] -> [end of tail entity]
To reproduce the experimental results, you should as well modify the vocab.txt
of your pre-downloaded BERT (e.g., in ./bert-base-uncased/vocab.txt
).
Citation
@inproceedings{yan2021large,
title={Large-Scale Relation Learning for Question Answering over Knowledge Bases with Pre-trained Language Models},
author={Yan, Yuanmeng and Li, Rumei and Wang, Sirui and Zhang, Hongzhi and Daoguang, Zan and Zhang, Fuzheng and Wu, Wei and Xu, Weiran},
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
pages={3653--3660},
year={2021}
}