understanding-ai icon indicating copy to clipboard operation
understanding-ai copied to clipboard

Bi-Directional Block Self-Attention for fast and memory-efficient sequence modeling

Open flrngel opened this issue 6 years ago • 0 comments

Paper at ICLR2018 https://openreview.net/forum?id=H1cWzoxA- aka BiBloSAN (Bi-BloSAN)

Abstract

  • CNN focuses local dependency
  • Bi-BloSAN achieves better efficiency-memory trade-off than RNN/CNN/SAN models

1. Introduction

  • RNN has problem for parallel processing
  • CNN number growth by position distance
    • Convolution Seq2Seq: Linearly
    • ByteNet: Logarithmically
  • Bi-BloSAN succeeds,
    • Transformer
    • DiSAN (Directional Self-Attention Network)
      • forward/backward mask
      • using feature level
  • SAN(Self-Attention Network) has memory problem and Bi-BloSAN solves it
  • Core idea of Bi-BloSAN is processing tokens after splitting sequence to same length
  • Bi-BloSAN benefits: memory-efficient, fast (intra -> inter block makes this possible)

2. Background

2.2 Vanilla Attention and Multi-dimensional Attention

  • Additive attention is better than Multiplicative(dot-product) attention for
    • memory, time, computing performance

2.3 Two types of Self-attention

  • token2token
    • x_i, x_j uses same x
  • source2token
    • token to entire sentence

2.4 Masked Self-attention

  • (Shen et al., 2017) used mask to make one way direction
  • M is for masking, uses (forward)
    • 0; if i<j
    • -inf; else

3. Proposed Model

m-BloSA

image

Bi-BloSAN

image

4. Experiments

Terminology

  • x: input sequence
  • q: query
  • i: position from sequence
  • P: feature importance with evidence

flrngel avatar Feb 06 '18 09:02 flrngel