feed_forward_vqgan_clip icon indicating copy to clipboard operation
feed_forward_vqgan_clip copied to clipboard

CLIP-guided-diffusion updates

Open afiaka87 opened this issue 2 years ago • 5 comments

Katherine has released a better notebook for the CLIP-guided-diffusion. Outputs on a P100 is quite slow; but results can be very good. I've put the new notebook in my current repo as the "HQ" version.

Is there any chance of using these concepts for diffusion in a similar way? Main issue I'm seeing is that the output from guided-diffusion is 256x256xRGB rather than 16x16 or 32x32 patches. It's also a much larger checkpoint than the VQGAN and diffusion is sort of inherently tough to reason about in my experience.

https://github.com/afiaka87/clip-guided-diffusion

afiaka87 avatar Jul 27 '21 17:07 afiaka87

Here is a link to the original notebook - which may include updates while the method is refined: https://colab.research.google.com/drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj

afiaka87 avatar Jul 27 '21 17:07 afiaka87

Thanks for the links, I will give it a try, it would definitely be great to try the same idea there

mehdidc avatar Jul 27 '21 18:07 mehdidc

"diffusion is sort of inherently tough to reason about in my experience." Yes, one difficulty I see here is that if we apply the same kind of idea on diffusion, we would need to backprop through a huge chain (1000 steps in the paper, but I also saw a section about 'optimizing' the steps schedule at test/sampling time, and they could use a different schedule at sampling time from training ) which means it would be hard to optimize

mehdidc avatar Jul 27 '21 18:07 mehdidc

Another thing is, diffusion models are already iterative, so the "feed forward" aspect wouldn't apply here, we still need to go through all the timesteps to generate the image. The feed forward model could however make sampling much much faster, we could imagine using it to "jump" to a timestep directly from the input prompt, then we can continue the sampling to generate the final image, in this case it would still be helpful. So basically the feed forward model would take an input prompt and generate a blurry image with the general aspect of the image, then we use the diffusion model for the rest of the steps to add the details.

mehdidc avatar Jul 27 '21 18:07 mehdidc

Another thing is, diffusion models are already iterative, so the "feed forward" aspect wouldn't apply here, we still need to go through all the timesteps to generate the image. The feed forward model could however make sampling much much faster, we could imagine using it to "jump" to a timestep directly from the input prompt, then we can continue the sampling to generate the final image, in this case it would still be helpful. So basically the feed forward model would take an input prompt and generate a blurry image with the general aspect of the image, then we use the diffusion model for the rest of the steps to add the details.

Interesting - this is a notion which I asked Katherine about in a discord discussion. She seemed to think finetuning the diffusion checkpoints would be easier perhaps. @crowsonkb would love your input/corrections here btw.

I still haven't taken a deep dive into the guided-diffusion codebase itself; but CLIP-driven is indeed riddled with issues which are present with the VQGAN. There are plenty of efforts to resolve that. One idea was to finetune the guided diffusion checkpoint to introduce a bit more diversity than just ImageNet. I think that's the biggest issue? It's heavily biased towards photos in imagenet; and photorealism more generally. Anecdotally, the more abstract of a caption you use with clip-guided-diffusion; the less likely you are to get a matching result.

Anyway what do you think the feasibility of using a transformer/mlp-mixer to learn good starting timestep embeds for a given caption? If it's not too difficult; it might be worth pursuing.

If you are approaching this from the replication angle ; it's certainly worthwhile to experiment on this (clip guided diffusion, specifically) because they seem to make some claims about how doing so is impractical on their github repo. I think krowsonkb's work has already shown that not to be true however.

afiaka87 avatar Aug 02 '21 18:08 afiaka87