understanding-ai
understanding-ai copied to clipboard
Bi-Directional Block Self-Attention for fast and memory-efficient sequence modeling
Paper at ICLR2018 https://openreview.net/forum?id=H1cWzoxA- aka BiBloSAN (Bi-BloSAN)
Abstract
- CNN focuses local dependency
- Bi-BloSAN achieves better efficiency-memory trade-off than RNN/CNN/SAN models
1. Introduction
- RNN has problem for parallel processing
- CNN number growth by position distance
- Convolution Seq2Seq: Linearly
- ByteNet: Logarithmically
- Bi-BloSAN succeeds,
- Transformer
- DiSAN (Directional Self-Attention Network)
- forward/backward mask
- using feature level
- SAN(Self-Attention Network) has memory problem and Bi-BloSAN solves it
- Core idea of Bi-BloSAN is processing tokens after splitting sequence to same length
- Bi-BloSAN benefits: memory-efficient, fast (intra -> inter block makes this possible)
2. Background
2.2 Vanilla Attention and Multi-dimensional Attention
- Additive attention is better than Multiplicative(dot-product) attention for
- memory, time, computing performance
2.3 Two types of Self-attention
- token2token
- x_i, x_j uses same x
- source2token
- token to entire sentence
2.4 Masked Self-attention
- (Shen et al., 2017) used mask to make one way direction
- M is for masking, uses (forward)
- 0; if i<j
- -inf; else
3. Proposed Model
m-BloSA
Bi-BloSAN
4. Experiments
Terminology
- x: input sequence
- q: query
- i: position from sequence
- P: feature importance with evidence