feed_forward_vqgan_clip icon indicating copy to clipboard operation
feed_forward_vqgan_clip copied to clipboard

Positional Stickiness

Open afiaka87 opened this issue 2 years ago • 25 comments

For lack of a better word; I've noticed during training that the VitGAN tends to get stuck on one, two, or three (i don't see four happen very often/at all) "positional blobs" for lack of better words.

Does this match your experience? Effectively what I'm see is that the VitGAN needs to slide from one generation to the next in its latent space. In doing so - it seems to find that it's easier to just sort of create two "spots" in the image that are highly likely to contain specific concepts from each caption.

Does this match your experience? Any idea if this is bad/good? In my experience with the "chimera" examples; it seems to hurt things.

progress_0000422000 progress_0000421900 progress_0000421400

I hope you can see what I mean - there's a position in particular that seems designated for the "head" of the animal. But it also biases the outputs from other captions as well; for instance -

tri - x 4 0 0 tx a cylinder made of coffee beans . a cylinder with the texture of coffee beans . progress_0000418200

afiaka87 avatar Jul 27 '21 16:07 afiaka87

Yes I see exactly what you mean and noticed this in all the models I trained (both VitGAN and mlp_mixer), this was the reason why I started by the way to work on the diversity loss, but it does not work very well so far (quality drops and if the coeficient is small enough it only starts to output the same objects with different palettes if the noise vector changes, if the coeficient is bigger then it just outputs random textures unrelated to the prompt). Yes, I also thought similarly that it was a kind of shortcut the model finds to optimize the loss, and the first thing I could think of to fix it is to explicitly add an additional constraint on the loss (diversity).

mehdidc avatar Jul 27 '21 17:07 mehdidc

It could also be related to the architectures themselves (VitGAN and mlp_mixer), not sure. Another way to make the constraint even more explicit is to add a diversity loss on the "shapes" of the objects (rather than the "style") to avoid the model sticking on one shape.

mehdidc avatar Jul 27 '21 17:07 mehdidc

@mehdidc interesting so this happens with mlp-mixer too? Hm. There are of course examples to this which sort of defy the notion that it's a universal problem - if the captions are sufficiently different you can get it to output something quite different. For instance - the "photo of san francisco" captions tend to produce wildly different outputs - although you can sort of see that it's still getting caught up on the same position:

progress_0000423500

I think in this example - rather than "animal head goes here - other animal's body goes there" - it is changing to "foreground goes here - background goes there"

afiaka87 avatar Jul 27 '21 17:07 afiaka87

"For instance - the "photo of san francisco" captions tend to produce wildly different outputs " Ah okay, so what are the text prompts here where you observed different outputs, you mix "photo of san francisco" with the different attributes that you mentioned before such as "8k resolution"?

mehdidc avatar Jul 27 '21 17:07 mehdidc

@mehdidc Indeed - but I saw very similar results from the prepend all captions with minimalism checkpoint and I believe it would still happened with no modifications to the captions.

afiaka87 avatar Jul 27 '21 17:07 afiaka87

Generating a video from the training samples is maybe the best way to illustrate the issue:

ffmpeg -framerate 15 -pattern_type glob -i '*.png' -c:v libx264 -pix_fmt yuv420p training_as_video.mp4

afiaka87 avatar Jul 27 '21 17:07 afiaka87

Yes exactly!

mehdidc avatar Jul 27 '21 18:07 mehdidc

Another way to see is through interpolation, here is an video showing interpolation (of text encoded features) from "the sun is a vanishing point absorbing shadows" to "the moon is a vanishing point absorbing shadows": https://drive.google.com/file/d/16yreg0jajmC4qwJGmiGJypp_L_VQbhq5/view?usp=sharing

mehdidc avatar Jul 27 '21 20:07 mehdidc

Not an answer - but perhaps a direction for enquiry in this repo by @nerdyrodent https://github.com/nerdyrodent/VQGAN-CLIP

in the readme - he spells out a way to alter the weights of the words being passed in.

python generate.py -p "A painting of an apple in a fruit bowl | psychedelic | surreal:0.5 | weird:0.25"

for me - this coefficient tacked on to the end - provided a way to directed output with more control.

tri - x 400tx a cylinder made of coffee beans :0.5 . a cylinder with the texture of coffee beans

Need to dig into the code - but this may help transcend this blob problem.

johndpope avatar Jul 27 '21 20:07 johndpope

Another possible direction: @kevinzakka has a colab notebook here for getting saliency maps out of CLIP from a specific text-image prompt. https://github.com/kevinzakka/clip_playground/blob/main/CLIP_GradCAM_Visualization.ipynb

itsthecat

afiaka87 avatar Jul 29 '21 16:07 afiaka87

I actually like this so called "stickiness" effect. It creates some kind of continuity over large amounts of generated images.

Is there a way to influence the "sticky spacial shape"? Is it shaped during training, or can it be modified somehow?

mesw avatar Nov 05 '21 12:11 mesw

@mesw As far as I can tell it is inherited from the biases present in CLIP and perhaps in the specific captions used to train this repo.

@mehdidc oh yes I forgot to mention. @crowsonkb suggested that the loss used in CLOOB "InfoLOOB", would be appropriate for these methods and may help output more diverse generations for a given caption.

The blog linked mentions the problem of "explaining away" which seems correlated with this issue perhaps?

edit: Implementation here https://github.com/ml-jku/cloob Screen Shot 2021-11-12 at 7 00 54 AM

afiaka87 avatar Nov 12 '21 12:11 afiaka87

I put in the whole of /usr/share/dict and did a nearest neighbour search over Z for the term nebula, which gave this, which I think shows a hint as to the positional stickiness. You get a similar result for any word you can think of - lots of images that look very similar.

What I would like to see is a 'seed' input alongside the text input, which provokes the net into producing a completely different output for the same textual input. Is there a way to train that? Could you, for example, have a loss which makes it so that if the net is fed prompt=(foobar) cross product seed=(1, 2, 3, 4, 5), the loss is higher the greater the distance of the GAN coordinates are for the different seeds?

You can provoke the net into producing different images by random vectors to H, but it's not the same! :D

nebula, meteoric, interstellar, insignificant, splatters
astronomers, elemental, geode, trinities, transcendence
foretold, rift, glitters, atom, glistens
enthralls, divinity, angelically, opalescent, mystique

image

pwaller avatar Nov 12 '21 16:11 pwaller

It's interesting that all of the above images have a black spot in the bottom right. What's that about? :)

pwaller avatar Nov 12 '21 16:11 pwaller

I think the problem might be that you aren't modeling any randomness in the output at all.

Your input for a given prompt is deterministic (based on what CLIP says the vector is) and you're then feeding that into a the model and optimizing it to produce a single image. Even though that image target is changing every time, the model doesn't know that it's supposed to be able to generate multiple random images for one given prompt.

I think the best way to tackle the problem would be to add a latent vector input to the model. Just e.g. a 32-dim random normal sample. Then the model would hopefully learn that different normal samples with the same fixed prompt mean different images. That way you could explicitly change the overall structure of the image by sampling different latents.

JCBrouwer avatar Nov 13 '21 08:11 JCBrouwer

@afiaka87 thanks will definitely give a try to CLOOB, pre-trained models seem to be available (https://ml.jku.at/research/CLOOB/downloads/checkpoints/)

mehdidc avatar Nov 13 '21 09:11 mehdidc

@pwaller just to be sure to understand, you computed CLIP text features of all words in /usr/share/dict, then did a nearest neighbour search of the word 'nebula' on the CLIP text feature space, is that correct ? is it what is shown in the grid of images?

@pwaller @JCBrouwer Yes exactly, generating a diverse set of images from text using a random seed would be a must, I did simple attempt by concatenating a random normal vector with CLIP features, generating a set of images from the same text, then computing a diversity loss using VGG features. This kind of diversity loss has already been tried in feed-forward texture synthesis https://arxiv.org/pdf/1703.01664.pdf. However, it didn't work so well so far, could be that a different feature space (other than VGG) for diversity is needed, or a more fancy way to incorporate randomness into the architecture instead of just concatenating CLIP text features and a random normal vector. The diversity loss + randomness is already possible in the current code, you can check the notebook for an explanation on how to do it, but as I said I find the results so far are not good enough, the overall quality is reduced, and the diversity loss coefficient should be small enough otherwise the images are not recognizable anymore, they just end up resembling textures if the coeficient is big enough.

Here is one attempt with the word 'nebula' on a model trained with diversity loss (each image is a different seed):

gen

another with 'castle in the sky':

gen

mehdidc avatar Nov 13 '21 10:11 mehdidc

@pwaller just to be sure to understand, you computed CLIP text features of all words in /usr/share/dict, then did a nearest neighbour search of the word 'nebula' on the CLIP text feature space, is that correct ? is it what is shown in the grid of images?

Yes, that's correct. Sorry I missed this message before.

Is there a model published trained with the diversity loss? I don't have the time/capability to easily train it myself for the foreseeable future. I would be very curious to take a look at outputs with the possibility of varying the seed.

Edit: Hold on, have I misunderstood? Is it possible to access a pretrained model using the diversity loss using omegaconf somehow?

pwaller avatar Jan 02 '22 23:01 pwaller

@pwaller I have trained some models using diversity loss like the one above but didn't make them available publicly yet. Here is a link for the model above. You can try it using e.g.,

python main.py test model_with_diversity.th <your_text> --nb-repeats=8

to sample 8 independent images given the same text.

Recently, I have experimented with a different way to incorporate diversity: train a probabilistic model (using normalizing flows from https://compvis.github.io/net2net/) that maps CLIP text features to CLIP image features and, separately, train a model that maps CLIP image features to VQGAN latent space, trained exactly like done currently in this repo except that input is image features rather than text features. Combing the two gives a way to generate different images given the same text. Example with "Castle in the sky":

castle_in_the_sky

seems it does not always fit the text correctly, this could probably be improved, by training a better model or at least using CLIP based re-ranking

mehdidc avatar Jan 04 '22 12:01 mehdidc

That looks amazing. Thanks for sharing, really excited to go and play with it. :rocket:

pwaller avatar Jan 04 '22 17:01 pwaller

Ah, I see I misunderstood, model_with_diversity.th did not produce the above image. Any plans to publish the above model? Looks very fun. Your work here is a lot of fun, many thanks for sharing this with the world :)

pwaller avatar Jan 04 '22 21:01 pwaller

@pwaller Happy to know you find it useful and have fun with it :) I have too. Sure, I just pushed a new branch https://github.com/mehdidc/feed_forward_vqgan_clip/tree/flow with support for normalizing flows for text to image features. First, you need to install net2net from from https://github.com/CompVis/net2net. The link to the two models is provided in the following:

  • Link of the model that maps CLIP text features to CLIP image features
  • Link of the model that maps CLIP image features to VQGAN latent space

Once you download them, you can try to generate using e.g. the following:

python main.py test cc12m_imfeats2im.th "castle in the sky" --flow-model-path=cc12m_textfeats2imfeats.th --nb-repeats=20 --images-per-row=5

Let me know if it does not work for you.

mehdidc avatar Jan 05 '22 00:01 mehdidc

Let me know if it does not work for you.

I get main.py test: Unknown option '--flow-model-path'. I see that the flow branch is from Aug 4th.

pwaller avatar Jan 05 '22 17:01 pwaller

Oh, actually I made a mistake, the branch is net2net https://github.com/mehdidc/feed_forward_vqgan_clip/tree/net2net

mehdidc avatar Jan 05 '22 21:01 mehdidc

merged now into master, explanations in the README, where I refer to those as "priors"

mehdidc avatar Jul 13 '22 23:07 mehdidc