stable-diffusion
stable-diffusion copied to clipboard
T5 instead of CLIP
Dear stable-diffusion team,
Thank you for sharing this great work. I really like it.
Have you consider using pretrained T5 encoder instead of pretrained CLIP? According to Imagen paper, T5-XXL is better than CLIP.
''' We also find that while T5-XXL and CLIP text encoders perform similarly on simple benchmarks such as MS-COCO, human evaluators prefer T5-XXL encoders over CLIP text encoders in both image-text alignment and image fidelity on DrawBench, a set of challenging and compositional prompts. '''
Thank you for your help.
Best Wishes,
Zongze
Hi, I'm planning to replace a Clip TextEncoder with a T5 model recently. However, T5 has an encoder-decoder structure. Which layer of T5 should I use as the feature output for text tokens? tks
Hi, I'm planning to replace a Clip TextEncoder with a T5 model recently. However, T5 has an encoder-decoder structure. Which layer of T5 should I use as the feature output for text tokens? tks
Hello, have you tried it? How did it work?