neural_name_tagging
neural_name_tagging copied to clipboard
Code for "Reliability-aware Dynamic Feature Composition for Name Tagging" (ACL2019)
Dynamic Feature Composition for Name Tagging
Code for our ACL2019 paper Reliability-aware Dynamic Feature Composition for Name Tagging.
Input Data Set Directory Structure
- <input_dir>
embed.vocab.tsv(embedding vocab file, 1st column: token, 2nd column: index)embed.count.tsv(embedding token frequency file, 1st column: token, 2nd column: frequency)bctrain.tsv(training set)dev.tsv(development set)test.tsv(test set)token.vocab.tsv(token vocab file, 1st column: token, 2nd column: index)char.vocab.tsv(character vocab file: 1st column: character, 2nd column: index)label.vocab.tsv(label vocab file: 1st column: label, 2nd column: index)
bnmznwtcwb
Note:
- Other subsets have
train.tsv,dev.tsv,test.tsv,token.vocab.tsv,char.vocab.tsv, andlabel.vocab.tsvin their directories. - In our experiments, we generated
*.vocab.tsvfrom a merged data set of all subsets. - In our experiments, we use CoNLL format files generated from OntoNotes 5.0 with Pradhan et al.'s scripts, which can be found at https://cemantix.org/data/ontonotes.html.
Pre-processing
The following functions in proprocess.py can be used to create vocab and frequency files.
build_all_vocabstakes as input a list of CoNLL format files, and generate{token,char,label}.vocab.tsvinoutput_dir.build_embed_vocabtakes a pre-trained embedding file as input and return the embedding vocab.build_embed_token_counttakes a pre-trained embedding file as input and generate an embedding token frequency file.
Train LSTM-CNN
python train_lstmcnn_all.py -d 0 -i <input_dir> -o <output_dir> -e <embedding_file>
--embed_vocab <embedding_vocab_file> --char_dim 50 --seed <random_seed>
This script train a model for each subset (which can be specified with the --datasets argument) and report within-subset (within-genre) and cross-subset (cross-genre) performance.
Train LSTM-CNN with Dynamic Feature Composition
python train_lstmcnn_dfc_all.py -d 0 -i <input_dir> -o <output_dir> -e <embedding_file>
--embed_vocab <embedding_vocab_file> --embed_count <embedding_freq_file> --char_dim 50 --seed <random_seed>
Requirement
- Python 3.5+
- Pytorch 1.0
Resources
- We use the 100d case-sensitive word embedding in Pre-trained Word Embeddings
Reference
Lin, Y., Liu, L., Ji, H., Yu, D., Han, J. (2019) Reliability-aware Dynamic Feature Composition for Name Tagging. Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics.
@article{lin2019reliability,
title={Reliability-aware Dynamic Feature Composition for Name Tagging},
author={Lin, Ying and Liu, Liyuan and Ji, Heng and Yu, Dong and Han, Jiawei},
booktitle={Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL2019)},
year={2019}
}