[23] Long-Short Transformer: Efficient Transformers for Language and Vision

Open dhkim0225 opened this issue 3 years ago • 0 comments

paper code

long-range attention 과 short attention 을 각각 보자는 논문

INTRO

다양한 efficient attention 나왔지만 language, vision 둘 다 좋은 성능을 내는 방법들이 없었음.
pattern을 박는 sparse attention 방법론은 attention의 가능성을 제한해 버림
low-rank projection 은 NLP 진영에서만 검증됨

Contributions

Long-Short Transformer (Transformer-LS) 제안
Long Range Arena (LRA) benchmark sota (S4 나오기 전까지 ..)
Auto-regressive, bidirectional 하게 linear complexity 로 해결 가능

Transformer-LS

Notions

query, key, value $Q,K,V \in R^{nxd}$ projection matrix for output $W^O \in R^{dxd}$ i번째 head (dot-product attn) $H_i \in R^{nxd_k}$ head 별 dimension $d_k = d/h$

Short-term Attention via Segment-wise Sliding Window

LongFormer 나 BigBird 처럼 sliding window 수행. 다만, "segment" 기반으로 수행. PyTorch 에서는 단순히 pixel 마다 길이를 상하좌우 w 로 주는 것보다 이게 빨랐음 (?)

input sequence 를 length w 의 disjoint segment로 나눔
segment 내부에서는 전부 attend && home segment 의 좌우 2/w 크기만큼 추가 attend (총 2w 만큼 봄)

아래 이미지는 LongFormer 의 sliding window attn figure

물론 이 녀석을 쌓으면 long range 도 볼 수는 있지만, 성능이 떨어지니까 long range attention 관련 method 를 따로 둠.

autoregressive 하게 붙일 때는 attention을 다르게 붙임. 2d 그림으로 보면,

Long-range Attention via Dynamic Projection

prjection param (Linformer 처럼 learnable) $W_i^P \in R^{dxr}$ i번째 head의 dynamic low-rank projection $P_i=f(K) \in R^{nxr}$ (r << n)

Aggregating Long-range and Short-term Attention

이제 long 과 short 를 합쳐줘야 하는데, Head마다 다른 attention mechanism 을 두고 합치는 방식이 아니라, local 과 global attn 값이 서로를 이해하는데 도움을 줄 수 있도록 구성했다고 한다.

position t 에서의 i번째 attention 은 다음과 같다. global low-rank projected keys and values $\bar{K}_i, \bar{V}_i \in R^{rxd_k}$ local keys and values $\tilde{K}_i, \tilde{V}_i \in R^{2wxd}$