emnlp2017-bilstm-cnn-crf Wrong transition in crf when doing a sequence labeling task

I use the ChainCRF.py as the CRF Layer in my model to do a sequence labeling task using the OBIE as the tags ,but I meet a problemthat there are some unexpected transition in the predict like E to I. And it doesn't show up in the train data. The keras version is 2.2.2.And tensorflow is 1.10.0 the code:

from keras.preprocessing import text, sequence
from keras.layers import *
from keras.models import *
from keras.callbacks import EarlyStopping,ModelCheckpoint
from ChainCRF import ChainCRF
from keras import backend as K

def Bilstm_CNN_Crf(maxlen,nb_words,class_label_count,embedding_weights=None,is_train=True):
    word_input=Input(shape=(maxlen,),dtype='int32',name='word_input')
    word_emb=Embedding(nb_words+1,output_dim=100,\
                    input_length=maxlen,\
                    embeddings_initializer = 'uniform',
                    name='word_emb')(word_input)
    # bilstm
    bilstm=Bidirectional(LSTM(64,return_sequences=True))(word_emb)
    bilstm_d=Dropout(0.1)(bilstm)

    # cnn
    half_window_size=2
    padding_layer=ZeroPadding1D(padding=half_window_size)(word_emb)
    conv=Conv1D(nb_filter=50,filter_length=2*half_window_size+1,\
            padding='valid')(padding_layer)
    conv_d=Dropout(0.1)(conv)
    dense_conv=TimeDistributed(Dense(50))(conv_d)

    # merge
    rnn_cnn_merge=concatenate([bilstm_d,dense_conv])
    dense=TimeDistributed(Dense(class_label_count))(rnn_cnn_merge)

    # crf
    crf = ChainCRF(name='CRF_Layer')
    crf_output=crf(dense)

    # build model
    model=Model(inputs=[word_input],outputs=[crf_output])

    model.compile(loss=crf.loss,optimizer='adam',metrics=['accuracy'])

    # model.summary()

    return model

model = Bilstm_CNN_Crf(maxlen, nb_words, 5)
earlystop = EarlyStopping(monitor='val_acc',patience=2,verbose=1)
checkpoint = ModelCheckpoint('best_model.hdf5',monitor='val_acc',verbose=1,save_best_only=True,period=1,save_weights_only=True)
model.fit(x_train_1, y, epochs=epochs, batch_size=64, verbose=1,validation_data=(x_train_1,y),callbacks=[earlystop,checkpoint])
model.load_weights('best_model.hdf5')
pred_prob = model.predict(x_train_1)
pred = np.argmax(pred_prob, axis=2)

Is there something wrong with the model?Or somet badcase that i didnt find in the data? Any help is appreciate!Thx!

Oct 25 '18 05:10 SefaZeng

Hi @SefaZeng This issue also happens with my code: in-valid transitions (e.g. O I-PER) are produced by the BiLSTM-CRF model.

The issue is sadly not trivial and I don't know how to fix it.

The CRF is initialized with random probabilities for the transitions, i.e. O I-PER can be as likely as O B-PER. Of course, the CRF does not know anything from the encoding and about allowed transitions.

During training, these transition probabilities are updated, so that the CRF learns that O I-PER is unlikely. However, it converges rather slowly to a 0 probability. This makes sense, as how should the CRF be able to distinguish that O I-PER is not possible at all and 'it is rare but I haven't seen enough data'.

With more epochs, the number of invalid tags usually converge to a low number or even to zero in my experiments.

As I solution what I use is a post-processing step: The code checks whether the tags from the CRF are valid BIO-encoded. If it finds an invalid tag, it sets this tag to O.

Oct 25 '18 08:10 nreimers

Hi @SefaZeng This issue also happens with my code: in-valid transitions (e.g. O I-PER) are produced by the BiLSTM-CRF model.

The issue is sadly not trivial and I don't know how to fix it.

The CRF is initialized with random probabilities for the transitions, i.e. O I-PER can be as likely as O B-PER. Of course, the CRF does not know anything from the encoding and about allowed transitions.

During training, these transition probabilities are updated, so that the CRF learns that O I-PER is unlikely. However, it converges rather slowly to a 0 probability. This makes sense, as how should the CRF be able to distinguish that O I-PER is not possible at all and 'it is rare but I haven't seen enough data'.

With more epochs, the number of invalid tags usually converge to a low number or even to zero in my experiments.

As I solution what I use is a post-processing step: The code checks whether the tags from the CRF are valid BIO-encoded. If it finds an invalid tag, it sets this tag to O.

Can I set the initial states to zero to avoid this problem?

Oct 25 '18 08:10 SefaZeng

@SefaZeng I think that could work, however, you would need to ensure to get the mapping right. Especially when the number of tags changes (e.g. you add B-LOC and I-LOC to your tagset), you must ensure that you set the zeros at the right place. Otherwise it can easily happen that B-LOC => I-LOC is initialized with a zero probability.

Further, the CRF is bi-directional, i.e. not only the previous label is important but also the next label determines which label is produced. This can make it rather complicated to initialize the CRF correctly.

Oct 25 '18 08:10 nreimers

@nreimers Emmm.. I set the initializer of U, b_start, b_end and initial state in the viterbi_decode to zeros,but it doesn't work.Maybe post-processing is the only way. But I am still confusing why it will happen.Because in statistic opinion, if the in-valid transitions never appear in the data,the probability or maybe the weights in the neural network should be very low or only zero.

Oct 25 '18 08:10 SefaZeng

emnlp2017-bilstm-cnn-crf emnlp2017-bilstm-cnn-crf copied to clipboard

Wrong transition in crf when doing a sequence labeling task

emnlp2017-bilstm-cnn-crf
emnlp2017-bilstm-cnn-crf copied to clipboard