trafficstars

Chinese-sentence-pair-modeling

This repository contains the following models for sentence pair modeling: BiLSTM (max-pooling), BiGRU (element-wise product), BiLSTM (self-attention), ABCNN, RE2, ESIM, BiMPM, Siamese BERT, BERT, RoBERTa, XLNet, DistilBERT and ALBERT. All these codes are based on PyTorch and you are recommended to run the "ipynb" files in Google Colab, where you could get GPU resources for free.

1. Datasets

I conduct experiments on 5 Chinese datasets: 3 paraphrase identification datasets and 2 natural language inference datasets. Tables below give a brief comparison of these datasets.

Note: the BQ Corpus dataset requires you to send an application form, which can be downloaded from http://icrc.hitsz.edu.cn/Article/show/175.html . The CMNLI dataset is too large, and you can download it from https://storage.googleapis.com/cluebenchmark/tasks/cmnli_public.zip . Due to the unbalanced categories (labels of 1, 2, 3, 4 account for a smallpercentage) of ChineseSTS, I drop these few labels and converts the dataset into an binary classification task. What's more, OCNLI and CMNLI datasets are preprocessed by removing the data with missing labels.

2. Implementation Details

After analyzing the distributions of lengths of sentences in 5 datasets, the max_sequence_length for truncation is set to 64 for convenient comparisons. What's more, the hidden size is set to 200 in all models using BiLSTM.

For models of BiLSTM (max-pooling), BiGRU (element-wise product), BiLSTM (self-attention), ABCNN, RE2, ESIM and BiMPM, I apply character embedding and word embedding respectively while tokenizing sentences into tokens. The pre-trained character embedding matrix contains 300-dimensional character vectors trained on Wikipedia_zh corpus (please download it from https://github.com/liuhuanyong/ChineseEmbedding/blob/master/model/token_vec_300.bin), while the word embedding matrix is composed of 300-dimensional word vectors trained on Baidu Encyclopedia (please download it from https://pan.baidu.com/s/1Rn7LtTH0n7SHyHPfjRHbkg).

As for Siamese BERT, BERT, BERT-wwm, RoBERTa, XLNet DistilBERT and ALBERT, learning rate is the most important hyperparameter (inappropriate choice may lead to divergence of models), which is generally chosen from 1e-5 to 1e-4. What's more, it should also be determined by the batchsize. A large batchsize should correspond to a large learning rate.

3. Experiment results and analysis

The following table shows the test accuracy (%) of different models on 5 datasets:

Model	LCQMC	ChineseSTS	BQ Corpus	OCNLI	CMNLI	Avg.
BiLSTM (max-pooling)-char-pre	74.4	97.5	70.0	60.6	56.7	71.8
BiLSTM (max-pooling)-word-pre	75.2	98.0	68.0	58.0	56.9	71.2
BiLSTM (self-attention)-char-pre	85.0	96.8	79.8	58.5	63.6	76.7
BiLSTM (self-attention)-word-pre	83.7	94.4	79.3	57.8	64.2	75.9
ABCNN-char-pre	79.5	97.2	78.8	53.2	63.2	74.4
ABCNN-word-pre	81.3	97.9	74.4	54.1	59.8	73.5
RE2-char-pre	84.2	98.7	80.4	61.0	68.6	78.6
RE2-word-pre	84.5	98.6	80.1	57.2	65.1	77.1
ESIM-char-pre	83.6	99.0	81.2	64.8	74.0	80.5
ESIM-word-pre	84	98.9	81.7	61.3	72.6	79.7
BiMPM-char-pre	83.6	98.9	79.2	63.9	69.7	79.1
BiMPM-word-pre	83.7	98.8	80.3	59.9	69.6	78.5
Siamese BERT	84.8	97.7	83.5	66.8	72.5	81.1
BERT	87.8	98.9	84.2	73.8	80.5	85.0
BERT-wwm	87.4	99.2	84.5	73.8	80.6	85.1
RoBERTa	87.5	99.2	84.6	75.5	80.6	85.5
XLNet	87.4	99.1	84.1	73.6	80.7	85.0
ALBERT	87.4	99.5	82.2	68.1	74.8	82.4

3.1 Char Embedding vs. Word Embedding

Note that the y_axis is the averaged accuracy on 5 different test sets. We can see that using method of char embedding gets greater performance than that of word embedding. It may be because that the word embedding matrix is much more sparse than char embedding matrix, so large quantities of weights of word vectors do not get updated during training. Besides, the out-of-vocabulary problem is more easily to happen in word embedding, which also weakens its performance.

3.2 Comparison of Average Test Accuracy on 5 Datasets

Here character embedding is chosen for BiLSTM (max-pooling), BiLSTM (self-attention), ABCNN, RE2, ESIM and BiMPM, and the accuracy is computed by taking average on 5 datasets. We can see that RoBERTa model gets the best performance among these models, and BERT-wwm is slightly better than BERT.

3.3 Comprehensive Evaluation of the Models

(P.S. the original papers can be accessed by clicking the hyperlinks)

Model	Accuracy(%)	Number of parameters (millions)	Average training speed (sentence pairs / second)	Average inference speed (sentence pairs / second)
BiLSTM (max-pooling)	71.8	16	1,351	6,250
BiLSTM (self-attention)	76.7	16	1,333	5,882
Siamese BERT	81.1	102	67	256
ABCNN	74.4	13	2,083	7,692
RE2	78.6	16	1,235	4,762
ESIM	80.5	17	1,818	8,333
BiMPM	79.1	13	500	1,099
BERT	85.0	102	149	476
BERT-wwm	85.1	102	147	476
RoBERTa	85.5	102	91	270
XLNet	85.0	117	105	278
ALBERT	82.4	12	91	270

Chinese-sentence-pair-modeling
Chinese-sentence-pair-modeling copied to clipboard

Metadata

Chinese-sentence-pair-modeling

1. Datasets

2. Implementation Details

3. Experiment results and analysis

3.1 Char Embedding vs. Word Embedding

3.2 Comparison of Average Test Accuracy on 5 Datasets

3.3 Comprehensive Evaluation of the Models

LICENSE

← Metadata

Owner

Metadata

Chinese-sentence-pair-modeling Chinese-sentence-pair-modeling copied to clipboard

Metadata

Chinese-sentence-pair-modeling

1. Datasets

2. Implementation Details

3. Experiment results and analysis

3.1 Char Embedding vs. Word Embedding

3.2 Comparison of Average Test Accuracy on 5 Datasets

3.3 Comprehensive Evaluation of the Models

LICENSE

← Metadata

Owner

Metadata

Chinese-sentence-pair-modeling
Chinese-sentence-pair-modeling copied to clipboard