Some questions...

Open CuddleSabe opened this issue 2 years ago • 1 comments

Hi, I want to do the Super Resolution task by replace the clip text feature with the clip image feature. I think the image feature space and the text feature space must be the one, so I think it can work. but when I just do it, the model just output some white images, what's the wrong?

Sep 21 '23 07:09 CuddleSabe

Hi, I want to do the Super Resolution task by replace the clip text feature with the clip image feature. I think the image feature space and the text feature space must be the one, so I think it can work. but when I just do it, the model just output some white images, what's the wrong?

Hello, I think the image feature space and the text feature space are not the same. Although CLIP has brought the two spaces as close as possible, there is still some gap between them. Moreover, GALIP uses text features before normalization for training, and it is not appropriate to directly replace text features with image features. You can change GALIP's code and retrain a version based on image features.

Sep 26 '23 08:09 tobran