understanding-ai
understanding-ai copied to clipboard
Language Modeling with Gated Convolutional Networks
https://arxiv.org/abs/1612.08083
Abstract
- propose gating mechanism
- uses WikiText-103 and Google Billion Words
- proposed model is very competitive to strong recurrent models on large scale language tasks
1. Introduction
- convolutional network has parallelization benefit
- but cuDNN is not optimized for 1d convolutions yet
- Gated linear solves vanishing gradient
- Compare to PixcelCNN, Oord et al. 2016, GLU is better than LSTM-style gating (GTU)
2. Approach
- convolutional has no temporal dependency compare to recurrent models
- recurrent models have infinite contexts but paper's experiment shows it is not necessary
-
GLU
- abstract model
- model uses adaptive softmax which assign higher capacity to very frequent words and lower capacity to rare words
- this results faster coputation and needs lower memory
3. Gating Mechanisms
- purpose of gating mechanism is to control what information should be propagated through the hierarchy of layers
- comparing to GTU (LSTM-style gating mechanism), gradient of GTU gradually vanishes because of downscaling factors tanh'(X) and \sigma'(X) but GLU doesn't have downscaling factor
- this can be thought of as a multiplicative skip connection (which helps gradients flow through the layers)
4. Experimental Setup
4.2. Training
- uses gradient clipping on training and it works well
4.3. Hyper-parameters
- initialize layers with Kaiming initialization
5. Results
5.3. Non-linear Modeling
- Bilinear layers + GLU performs best
TODO
- read http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf