CLIP-featurevis
CLIP-featurevis copied to clipboard
try to visualize the features of T5: Text-To-Text Transfer Transformer?
Imagen' key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model.
Theyalso find that while T5-XXL and CLIP text encoders perform similarly on simple benchmarks such as MS-COCO, human evaluators prefer T5-XXL encoders over CLIP text encoders in both image-text alignment and image fidelity on DrawBench, a set of challenging and compositional prompts.
So try to visualize the features of T5?
T5: https://github.com/google-research/text-to-text-transfer-transformer