crfchunking-with-wordrepresentations icon indicating copy to clipboard operation
crfchunking-with-wordrepresentations copied to clipboard

Train a CRF for syntactic chunking (CoNLL2000), and use word representations

CRF chunking with word representations

scripts and steps by Joseph Turian

A standard baseline in NLP is chunking (shallow parsing), which was the CoNLL 2000 shared task. Training a CRF is a standard approach to this task (Sha + Pereira 2003). In fact, many CRF implementations include instructions for reimplementing the Sha+Pereira chunker with identical features: crfsgd: http://leon.bottou.org/projects/sgd crf++: http://crfpp.sourceforge.net/ CRFsuite: http://www.chokkan.org/software/crfsuite/

We use CRFsuite because it makes it simple to modify the feature generation code, so one can easily add new features.

We have instructions and scripts for how we add word representations (Brown clusters and/or word embeddings) to the training.

INSTALLATION:

Download and install CRFsuite: http://www.chokkan.org/software/crfsuite/

You will need my common Python library: http://github.com/turian/common

Go into data/ and download the CoNLL train and test files: cd data/ wget http://www.cnts.ua.ac.be/conll2000/chunking/train.txt.gz wget http://www.cnts.ua.ac.be/conll2000/chunking/test.txt.gz gunzip *.gz

Download word representations: cd representations/

wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c100-freq1.txt
wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c320-freq1.txt
wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c1000-freq1.txt
wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c3200-freq1.txt

wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-1750000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d200.txt.gz
wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2030000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d100.txt.gz
wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2270000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.txt.gz
wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2280000000.LEARNING_RATE%3d1e-08.EMBEDDING_LEARNING_RATE%3d1e-07.EMBEDDING_SIZE%3d25.txt.gz
ln -s model-1750000000.LEARNING_RATE=1e-09.EMBEDDING_LEARNING_RATE=1e-06.EMBEDDING_SIZE=200.txt.gz cw-embeddings-200dim.txt.gz
ln -s model-2030000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.EMBEDDING_SIZE\=100.txt.gz cw-embeddings-100dim.txt.gz 
ln -s model-2270000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.txt.gz cw-embeddings-50dim.txt.gz
ln -s model-2280000000.LEARNING_RATE\=1e-08.EMBEDDING_LEARNING_RATE\=1e-07.EMBEDDING_SIZE\=25.txt.gz cw-embeddings-25dim.txt.gz

wget http://pylearn.org/turian/hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
wget http://pylearn.org/turian/hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
ln -s hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-100dim.txt.gz
ln -s hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-50dim.txt.gz 

BATCH EVALUATIONS

./scripts/train-and-evaluate.py -name baseline --dev --l2 2

WARNING: Everytime you change the --features parameter, you should also change the --name.

NOTE:

CRFsuite has benchmark results on the CoNLL shared task: http://www.chokkan.org/software/crfsuite/benchmark.html

However, I did not achievable achieve comparable F1 score on the CoNLL test set until I used the following parameters:

Dev F1 Test F1 params 94.04 93.63 l2=2 94.03 93.65 l2=3.2, possible_transitions=1 94.15 93.73 l2=3.2, possible_transitions=1, possible_states=1 94.16 93.79 SGD, l2=3.2, possible_transitions=1, possible_states=1

I chose the l2 penalty on the dev set, which was a subset of the training data. I then used this l2 penalty and trained over the entire training set.