[94] Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Open dhkim0225 opened this issue 2 years ago • 0 comments

큰 scale 로 seq2seq 형태의 모델을 train 시킴. GRIT benchmark 라고, AllenAI 쪽에서 공개한 multi-task benchmark 가 있는데, 하나의 모델로 이들 전부 task 를 풀 수 있는 work 는 처음이라고 주장함. 말 그대로 엄청나게 많은 task 들을 전부 풀어내는 형태.

Related works

꽤 읽을만함. 12-in-1 같은 형태는 shared-backbone 을 활용함. 이는 task-wise specialized head 가 존재해야 함.

NLP 진영의 T5, GPT-3, Palm, Opt, gopher 등은 task-specific head 없이도 성공적으로 동작함.

Vision 쪽은 어찌 해야 할까. text image input을 함께 주고, 처리시키는 방식은 꽤나 유명함. (cross attention 이든 simvlm 처럼 encoder 에서 짬뽕시키든 뭐든 간에 ㅇㅇ) SimVLM, BLIP 등이 있는데, 얘네는 visual output 은 만들 수 없음 (segmentation 같은 거 안됨)

GPV-1 은 bbox 를 생성할 수 있도록 만들었고, (object detection) GPV-2 는 GPV-1의 text decoder 를 사용해서 bbox input, bbox output 이 나올 수 있도록 만들었음 (region captioning 대응 가능) VL-T5, OFA, pix2seq 와 같이 location token 을 활용할 수도 있긴 함.

Gato (A generalist agent) 는 atari game 에서의 button press 나, 로봇의 joint 움직임 등을 input 으로 받게끔 확장했고, Flamingo 는 text, image, video 의 interleaved input 까지 지원함

다른 계열로 perceiver-io 나 uni-perceiver 도 있는데, 얘네도 generative task 를 풀지는 못함.

One-for-All (OFA) 가 concurrent work 로 있는데, unified-io 와 접근이 굉장히 비슷함. 다른 점은,

OFA 는 CNN 을 썼고, 여기선 T5 를 사용
여기서 더 큰 scale의 실험이 이루어짐.

UViM 도 concurrent work 인데, pretrained d-vae 대신 따로 second model 을 학습시키고, 여기서 훨씬 다양한 task 를 커버함.

Model

4개의 모델 scale (T5)

XL - 24 layers 2.8B params
Large - 24 layers, 776M params
Base - 12 layers, 241M params
Small - 6 layers, 71M params

학습은 pretraining - multi-task finetuning 2-step 이고, task-specific 하게 추가적으로 finetuning 을 수행하지는 않는다. pretraining 은 다양한 objective 로 이루어진다.

text-only data (common-crawl) 에 대해서는 T5 방식의 학습
이미지들에 대해서는 BEIT방식의 학습 (75% masking)
caption 데이터에 대해서는 "An image of" 를 prompt 로 붙이고, SimVLM 방식으로 학습시킴

multi-task training 에서는 80개의 public set 들을 죄다 때려박는 형태. 총 4가지 그룹이 될 수 있음

classical CV tasks
1. image classification
2. object detection
3. instance segmentation
4. depth/surface normal estimation
image synthesis tasks
1. image in-painting
2. image synthesis from caption/segmentation
V&L tasks
1. VQA
2. image captioning
3. visual commonsense reasoning
NLP tasks
1. GLUE
2. Squad
3. 등등