POS-Tagging-BiLSTM-CRF icon indicating copy to clipboard operation
POS-Tagging-BiLSTM-CRF copied to clipboard

A Tensorflow 2, Keras implementation of POS tagging using Bidirectional LSTM-CRF on Penn Treebank corpus (WSJ)

Table of contents

  1. Introduction
  2. Training procedure
  3. Experiments
    • data
    • features
      • word embedding
    • results
  4. References
  5. Usage

Bidirectional LSTM-CRF model for Sequence Tagging

A Tensorflow 2/Keras implementation of POS tagging task using Bidirectional Long Short Term Memory (denoted as BiLSTM) with Conditional Random Field on top of that BiLSTM layer (at the inference layer) to predict the most relevant POS tags. My work is not the first to apply a BI-LSTM-CRF model to NLP sequence tagging benchmark datasets but it might achieve State-Of-The-Art (or nearly) results on POS tagging, even with NER. Experimental results on the POS tagging corpus Penn Treebank (approximately 1 million tokens for Wall Street Journal) show that my model might achieve SOTA (reaching 98.93% accuracy at word level).

Training procedure

  • batch size: 256
  • learning rate: 0.001
  • number of epochs: 3
  • max length (the number of timesteps): 141
  • embedding size: 100
  • number of tags: 45
  • hidden BiLSTM layer: 1
  • hidden BiLSTM units: 100
  • recurrent dropout: 0.01

Experiments

Data

I test BI-LSTM-CRF networks on the Penn Treebank (POS tagging task), the table below shows the size of sentences, tokens and labels for training, validation and test sets respectively.

PTB POS
training sentence #, token # 33458, 798284
validation sentence #, token # 6374, 151744
test sentence #, token # 1346, 32853
label # 45

Features

Word Embedding

For Word Representation, i used pretrained word embedding Glove which each word corresponds to a 100-dimentional embedding vector.

Results

First, i set batch size to 64, the model was overfitting at epoch 2, then i changed batch size to 128, it was at epoch 3. Eventually, i set batch size to 256 and it reached the highest accuarcy (at word level): 98.93%.

References

My implementation is based on the following paper:

Huang, et al. "Bidirectional LSTM-CRF Models for Sequence Tagging" arXiv preprint arXiv:1508.01991 (2015).

Usage

Requirements

  • Tensorflow 2/Keras

  • Numpy

  • JSON

  • NLTK

  • argparse

      $ pip install -r requirements.txt
    

Training and Evaluating

    $ python train.py

Output:

    $ Viterbi accuracy: 98.93%
    

Graphs:

Accuracy

Loss

Testing

    $ python test.py --sent "My heart is always breaking for the ghosts that haunt this room."

Output:

    $ [('My', 'prp$'), ('heart', 'nn'), ('is', 'vbz'), 
    ('always', 'rb'), ('breaking', 'vbg'), ('for', 'in'), 
    ('the', 'dt'), ('ghosts', 'nns'), ('that', 'wdt'), 
    ('haunt', 'vbp'), ('this', 'dt'), 
    ('room', 'nn'), ('.', '.')]
    
    

Note: A standard dataset for POS tagging is the Wall Street Journal (WSJ) portion of the Penn Treebank, containing 45 different POS tags. Sections 0-18 are used for training, sections 19-21 for development, and sections 22-24 for testing. Models are evaluated based on accuracy. But I just own sections 2-21 (for training), i took 16% from it for development, and section 24 for testing. There's a little bit difference here but (I think) with this model, it would outperform SOTA results for POS tagging task (or nearly). The dataset is not public, contact me via email in case you want to use it.