understanding-ai
understanding-ai copied to clipboard
Non-Autoregressive Neural Machine Translation
https://arxiv.org/abs/1711.02281
Abstract
Features
- Non-Autoregressive (means output selves doesn't have dependency)
- Parallel outputs
How
- Knowledge distillation
- Input token fertilities
- Policy Gradient
1. Introduction
Paper model uses CNN and SAN (Transformer) to avoid autoregressive
2. Background
2.1. Autoregressive Neural Machine Translation
- Transformer's masking is better than CNN
2.2. Non-Autoregressive decoding
Problems of beam-search
- suffers from diminishing returns with respect to beam size
- limits search parallelism
They made output length variable T as probabilistic variable
2.3. The multimodality problem
Multimodality problem is problem of "high multimodal distribution of target translation"
3. The non-autoregressive transformer
3.3. Modeling fertility to tackle the multimodality problem
Used IBM Model 2 to use fertilities.
Definition of fertilities and it's benefit
- Definition: number of input word has been copied
- Provides natural factorization that dramatically reduces mode space
- Allows decoder easier
3.4. Translation predictor and the decoding process
- Argmax decoding
- Average decoding
- Noisy parallel decoding
4. Training
~~I didn't like this section~~
4.2. Fine-Tuning
Uses KL Divergence, RL, backpropagation
Word-level knowledge distillation (Teacher)
External fertility inference model
Todo
- (3.4) Search about average decoding and noisy parallel decoding