1day_1paper [11] Zero-Shot Text-to-Image Generation (DALL-E)

[11] Zero-Shot Text-to-Image Generation (DALL-E)

Open dhkim0225 opened this issue 3 years ago • 0 comments

유명하고 유명한 DALL-E dVAE 의 경우 #37 에서 path encoder로도 사용되었음. Text 를 갖고 auto-regressive 하게 이미지를 생성함.

paper official code non-official code - with training official blog yannic kilcher

충분한 데이터를 넣으면 zero-shot 으로도 좋은 성능을 냄 dVAE 학습결과. (생각보다,,, 약한 영역이 있다.)

모델 output.

DALL-E

픽셀을 그대로 input 으로 사용하고 likelihood objective 를 사용하면, high frequency 영역을 집중적으로 본다. low frequency 가 필요하기 때문에 이러한 구조는 computation 낭비. 이를 해결하기 위해 DALL-E는 다음과 같이 구성됨

256x256x3 이미지를 32x32x8192 grid로 ! (dVAE)
256 BPE-encoded text token 구함
text-token, image-token concat 해서 decoder 에 넣어줌

그리고 둘의 joint distribution을 모델링하기 위해 ELBO 사용 모델의 lower bound 는 다음과 같음 ~이미지 출처~ ~그냥 x 인데 x제곱은 블로그에서 잘못 그린듯~

images x, captions y, and the tokens z

Stage 1 : Learning the Visual Codebook

phi 와 theta 에 대해 ELBO maximization. initial prior pψ 는 uniform categorical distribution (8192 codebook) qφ 는 8192 logit에 parameterized 된 categorical distribution pψ 가 discrete 하니까 gumbel softmax 로 해결

안정적인 학습을 위해 다음과 같은 세 가지가 중요

relaxation temperature과 step size에 대한 annealing schedule temperature를 1/16으로 하면 relaxed validation ELBO가 실제 validation ELBO와 거의 유사해짐
encoder 마지막과 decoder 시작점에 1x1 conv 사용 relaxation 주변의 conv에서 receptive field 크기를 줄였을 때 실제 ELBO로 더 잘 일반화한다
encoder, decoder 의 resblock activation에 작은 상수 곱하기 시작 부분에서 학습이 안정적 KL weight beta = 6.6으로 설정했을 때 학습이 끝난 후 reconstruction error가 작다는 것을 실험적으로 발견

Stage 2: Learning the Prior

phi, theta 는 고정해놓고, prior를 학습. ELBO maximization gpt-3 와 굉장히 유사한 모델인데, input 이 [text-token, image-token] concat 일 뿐. image-token 은 dVAE logit에서 argmax sampling을 통해 얻음. (gumbel noise를 더하지 않음) BPE dropout ==> 10% 120억 parameter 의 sparse transformer 사용