emnlp2017-bilstm-cnn-crf icon indicating copy to clipboard operation
emnlp2017-bilstm-cnn-crf copied to clipboard

Train data description for NER training, how to train NER model

Open pramod2157 opened this issue 6 years ago • 4 comments

Looks like you are not using pos information for training the model for NER. can you share the metadata for each column in the dataset? There is also no description about training NER model. Can you please update the readme? Thanks a ton.

pramod2157 avatar Jul 25 '18 05:07 pramod2157

No, I didn't use the POS information. I would also recommend not to use it, because at inference you would need a POS tagger to detect the named entities in a sentence. Further, adding POS information does not improve the performance of the classifier.

Training on CoNLL 2003 NER is rather straight forward. Due to copyright issues I sadly cannot share the dataset, but here are the steps:

  1. Convert the strange IOB encoding in the original CoNLL 2003 dataset to an BIO encoding. See issue #22 why this is needed. If you need a script to convert from IOB to BIO, let me know.
  2. The files for CoNLL 2003 NER contains lines that start with -DOCSTART- => remove these lines, they are meta data from the dataset to indicate that a new document starts.
  3. Training is similar to the Train_Chunking.py . Only change the dataset description:
datasets = {
    'conll2003_ner':                            
        {'columns': {0:'tokens', 3:'NER_BIO'},   
         'label': 'NER_BIO',                     
         'evaluate': True,                   
         'commentSymbol': None}              
}

nreimers avatar Jul 25 '18 07:07 nreimers

Thanks.

pramod2157 avatar Jul 25 '18 12:07 pramod2157

Can you please share the script for converting IOB to BIO encoding?

pramod2157 avatar Jul 26 '18 08:07 pramod2157

Sure, here is the code. It assumes that there is a train.txt, dev.txt and test.txt in the folder with IOB encoding. It creates then train.txt.bio ...

"""
Converts the IOB encoding from CoNLL 2003 to BIO encoding
"""


filenames = ['train.txt', 'dev.txt', 'test.txt']

for filename in filenames:
    fOut = open(filename+'.bio', 'w')
    fIn = open(filename, 'r')
    
    for line in fIn:
        if line.startswith('-DOCSTART-'):
            lastChunk = 'O'
            lastNER = 'O'
            continue
        
        if len(line.strip()) == 0:
            lastChunk = 'O'
            lastNER = 'O'
            fOut.write("\n")
            continue
            
        
        splits = line.strip().split()
        
        chunk = splits[2]
        ner = splits[3]
        
        if chunk[0] == 'I':
            if chunk[1:] != lastChunk[1:]:
                chunk = 'B'+chunk[1:]
                
        if ner[0] == 'I':
            if ner[1:] != lastNER[1:]:
                ner = 'B'+ner[1:]
                
        splits[2] = chunk 
        splits[3] = ner
        
        fOut.write("\t".join(splits))
        fOut.write("\n")
        
        lastChunk = chunk
        lastNER = ner

nreimers avatar Jul 26 '18 09:07 nreimers