papernotes icon indicating copy to clipboard operation
papernotes copied to clipboard

A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction

Open howardyclo opened this issue 6 years ago • 1 comments

Metadata

Authors: Shamil Chollampatt and Hwee Tou Ng1 Organization: National University of Singapore Release Date: 2018 on Arxiv Link: https://arxiv.org/pdf/1801.08831.pdf

howardyclo avatar Apr 24 '18 08:04 howardyclo

Summary

This paper is the first group to successfully employ a fully convolutional Seq2Seq (Conv Seq2Seq) model (Gehring et al., 2017) on grammatical error correction (GEC) with some improvements: BPE tokenization, pre-training fastText on large native English corpora, pre-training auxiliary N-gram language model as re-scorer and incorporating task-specific features. The experiment results outperform strong statistical machine translation baseline and all previous neural-based approaches on this case in terms of CoNLL-2014 (grammaticality) and JFLEG dataset (fluency). Additionally, the attention visualization of LSTM-based and Conv-based model shows that LSTM-based model tends to focus on matching source words as opposed to Conv-based model which tends to focus on surrounding context words.


Highlights of Technical Detail

  • Pre-training of word embeddings: Initialize the word embeddings for the source and target words with pre-trained fastText embeddings.
  • Subword-level: BPE tokenization is performed on both parallel corpus (for training networks) and monolingual corpus (for training embeddings).
  • Training: The networks is trained with Nesterov’s Accelerated Gradient Descent (NAG) with a simplified formulation for Nesterov’s momentum.
  • Decoding and Rescoring: Use beam search with log-linear feature scoring function. Feature weights are computed by minimum error rate training (MERT) on the dev set. Features are: Edit operation (EO) (i.e., the number of token-level substitution, deletions and insertions between the source and the hypothesis sentence) and 5-gram language model score (i.e., the sum of log-probabilities of hypothesis sentence, the number of words in hypothesis sentence)

Experiment Settings

  • Parallel training data: Lang-8 and NUCLE (5.4K is taken out to train re-scorer), resulting in 1.3M sentence pairs for training networks.
  • Monolingual data for pre-training fastText word embeddings: Wikipedia (1.78B words)
  • Monolingual data for pre-training 5-gram language model: Common Crawl corpus (94B words)
  • Testing data:
    • CoNLL-2014 test set with F_{0.5} score computed using the MaxMatch scorer.
    • JFLEG with GLEU score.
  • Model is from publicly available PyTorch-based implementation - FairSeq:
    • Both source and target embeddings are 500 dimensions.
    • Each of source and target vocabularies consists of 30K most frequent BPE tokens from parallel data.
    • FastText embeddings is trained using skip-gram model and window size of 5.
    • Character N-gram sequences of size between 3-6 (both inclusive).
    • The embeddings are not fixed during training the networks.
    • Both encoder and decoder are made up with 7 convolution layers with window width 3.
    • Output if each encoder and decoder layer is 1024 dimensions.
    • Dropout: 0.2.
    • Trained on 3 NVIDIA Titan X GPUs with a batch size of 32 on each GPU and perform validation after every epoch concurrently on another NVIDIA Titan X GPU.
    • Learning rate: 0.25 with a annealing factor of 0.1 and a momentum value of 0.99.
    • Training a single model takes ~18 hours.
    • Beam size for decoding: 12.

Experiment Results

Table 1

Table 2 & 3

Notes on Table 3:

  • Models evaluated without using pre-trained word embeddings.
  • SLConv: Single-layer convolutional Seq2Seq model
  • MLConv: Multi-layer convolutional Seq2Seq model

Figures

Notes on attention visualization:

  • The attention weights shown for the MLConv model is the averaged attention weights of all decoder layers.

Insights

  • Table 3 shows that Bi-LSTM has a higher precision than the MLConv, but a lower recall.
  • Interestingly, the attention visualization may have some correlation to the precision & recall.
  • It demonstrates the ability of MLConv in capturing the context better, thereby favoring more corrections than copying of the source words.

Future Work

  • Integration of LSTM-based and Conv-based models.
  • Integration of web-scale LMs during beam-search and the fusion of neural LMs into the network.

References

howardyclo avatar Apr 24 '18 10:04 howardyclo