neural_name_tagging
                                
                                 neural_name_tagging copied to clipboard
                                
                                    neural_name_tagging copied to clipboard
                            
                            
                            
                        Code for "Reliability-aware Dynamic Feature Composition for Name Tagging" (ACL2019)
Dynamic Feature Composition for Name Tagging
Code for our ACL2019 paper Reliability-aware Dynamic Feature Composition for Name Tagging.
Input Data Set Directory Structure
- <input_dir>
- embed.vocab.tsv(embedding vocab file, 1st column: token, 2nd column: index)
- embed.count.tsv(embedding token frequency file, 1st column: token, 2nd column: frequency)
- bc- train.tsv(training set)
- dev.tsv(development set)
- test.tsv(test set)
- token.vocab.tsv(token vocab file, 1st column: token, 2nd column: index)
- char.vocab.tsv(character vocab file: 1st column: character, 2nd column: index)
- label.vocab.tsv(label vocab file: 1st column: label, 2nd column: index)
 
- bn
- mz
- nw
- tc
- wb
 
Note:
- Other subsets have train.tsv,dev.tsv,test.tsv,token.vocab.tsv,char.vocab.tsv, andlabel.vocab.tsvin their directories.
- In our experiments, we generated *.vocab.tsvfrom a merged data set of all subsets.
- In our experiments, we use CoNLL format files generated from OntoNotes 5.0 with Pradhan et al.'s scripts, which can be found at https://cemantix.org/data/ontonotes.html.
Pre-processing
The following functions in proprocess.py can be used to create vocab and frequency files.
- build_all_vocabstakes as input a list of CoNLL format files, and generate- {token,char,label}.vocab.tsvin- output_dir.
- build_embed_vocabtakes a pre-trained embedding file as input and return the embedding vocab.
- build_embed_token_counttakes a pre-trained embedding file as input and generate an embedding token frequency file.
Train LSTM-CNN
python train_lstmcnn_all.py -d 0 -i <input_dir> -o <output_dir> -e <embedding_file>
  --embed_vocab <embedding_vocab_file> --char_dim 50 --seed <random_seed>
This script train a model for each subset (which can be specified with the --datasets argument) and report within-subset (within-genre) and cross-subset (cross-genre) performance.
Train LSTM-CNN with Dynamic Feature Composition
python train_lstmcnn_dfc_all.py -d 0 -i <input_dir> -o <output_dir> -e <embedding_file>
  --embed_vocab <embedding_vocab_file> --embed_count <embedding_freq_file> --char_dim 50 --seed <random_seed>
Requirement
- Python 3.5+
- Pytorch 1.0
Resources
- We use the 100d case-sensitive word embedding in Pre-trained Word Embeddings
Reference
Lin, Y., Liu, L., Ji, H., Yu, D., Han, J. (2019) Reliability-aware Dynamic Feature Composition for Name Tagging. Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics.
@article{lin2019reliability,
  title={Reliability-aware Dynamic Feature Composition for Name Tagging},
  author={Lin, Ying and Liu, Liyuan and Ji, Heng and Yu, Dong and Han, Jiawei},
  booktitle={Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL2019)},
  year={2019}
}