emnlp2017-bilstm-cnn-crf
emnlp2017-bilstm-cnn-crf copied to clipboard
Train data description for NER training, how to train NER model
Looks like you are not using pos information for training the model for NER. can you share the metadata for each column in the dataset? There is also no description about training NER model. Can you please update the readme? Thanks a ton.
No, I didn't use the POS information. I would also recommend not to use it, because at inference you would need a POS tagger to detect the named entities in a sentence. Further, adding POS information does not improve the performance of the classifier.
Training on CoNLL 2003 NER is rather straight forward. Due to copyright issues I sadly cannot share the dataset, but here are the steps:
- Convert the strange IOB encoding in the original CoNLL 2003 dataset to an BIO encoding. See issue #22 why this is needed. If you need a script to convert from IOB to BIO, let me know.
- The files for CoNLL 2003 NER contains lines that start with -DOCSTART- => remove these lines, they are meta data from the dataset to indicate that a new document starts.
- Training is similar to the Train_Chunking.py . Only change the dataset description:
datasets = {
'conll2003_ner':
{'columns': {0:'tokens', 3:'NER_BIO'},
'label': 'NER_BIO',
'evaluate': True,
'commentSymbol': None}
}
Thanks.
Can you please share the script for converting IOB to BIO encoding?
Sure, here is the code. It assumes that there is a train.txt, dev.txt and test.txt in the folder with IOB encoding. It creates then train.txt.bio ...
"""
Converts the IOB encoding from CoNLL 2003 to BIO encoding
"""
filenames = ['train.txt', 'dev.txt', 'test.txt']
for filename in filenames:
fOut = open(filename+'.bio', 'w')
fIn = open(filename, 'r')
for line in fIn:
if line.startswith('-DOCSTART-'):
lastChunk = 'O'
lastNER = 'O'
continue
if len(line.strip()) == 0:
lastChunk = 'O'
lastNER = 'O'
fOut.write("\n")
continue
splits = line.strip().split()
chunk = splits[2]
ner = splits[3]
if chunk[0] == 'I':
if chunk[1:] != lastChunk[1:]:
chunk = 'B'+chunk[1:]
if ner[0] == 'I':
if ner[1:] != lastNER[1:]:
ner = 'B'+ner[1:]
splits[2] = chunk
splits[3] = ner
fOut.write("\t".join(splits))
fOut.write("\n")
lastChunk = chunk
lastNER = ner