stablediffusion
stablediffusion copied to clipboard
Why not use text encoder like GPT?
According paper of Google Imagen, increasing text encoder capacity can help a lot to generation performance, which they use T5-XXL as text encoder. Although T5-XXL is too big to apply in personal computer. But GPT Neo 1.3B/2.7B is trained on 800GB corpus and is not too big. I think it should improve the model understanding ability of natural language comparing with CLIP.