stable-diffusion T5 instead of CLIP

T5 instead of CLIP

Open betterze opened this issue 2 years ago • 2 comments

Dear stable-diffusion team,

Thank you for sharing this great work. I really like it.

Have you consider using pretrained T5 encoder instead of pretrained CLIP? According to Imagen paper, T5-XXL is better than CLIP.

''' We also find that while T5-XXL and CLIP text encoders perform similarly on simple benchmarks such as MS-COCO, human evaluators prefer T5-XXL encoders over CLIP text encoders in both image-text alignment and image fidelity on DrawBench, a set of challenging and compositional prompts. '''

Thank you for your help.

Best Wishes,

Zongze

Aug 19 '22 07:08 betterze

Hi, I'm planning to replace a Clip TextEncoder with a T5 model recently. However, T5 has an encoder-decoder structure. Which layer of T5 should I use as the feature output for text tokens? tks

Jan 29 '24 02:01 zyx1213271098

Hi, I'm planning to replace a Clip TextEncoder with a T5 model recently. However, T5 has an encoder-decoder structure. Which layer of T5 should I use as the feature output for text tokens? tks

Hello, have you tried it? How did it work?

Mar 12 '24 00:03 YZBPXX

stable-diffusion stable-diffusion copied to clipboard

T5 instead of CLIP

stable-diffusion
stable-diffusion copied to clipboard