feed_forward_vqgan_clip icon indicating copy to clipboard operation
feed_forward_vqgan_clip copied to clipboard

Not an issue - richer datasets

Open johndpope opened this issue 2 years ago • 7 comments

are you familiar with this https://twitter.com/e08477/status/1418440857578098691?s=21 ?

I want to do cityscape shots. Are you familiar with any relevant datasets? Can this repo help output higher quality images? Or does it help with the prompting?

johndpope avatar Jul 25 '21 05:07 johndpope

Hi, I was not aware of these, these are very beautiful! the repo is not meant to output higher quality images (quality should be the same as VQGAN-CLIP examples) or help with prompting, it is meant to do the same thing without needing an optimization loop for each prompt, and can also generalize to new unseen prompts in the training set. All you need is to collect/build a dataset of prompts and train the model with it, once it is done you can generate images with new prompts in a single step (so no optimization loop). I will shortly also upload pre-trained model(s) based on conceptual captions 12m prompts (https://github.com/google-research-datasets/conceptual-12m), if you would like to give it a try without re-training from scratch. Also, since you obtain a model at the end, additionally you can also interpolate between the generated images of different prompts. I hope the goal of the repo is clearer.

mehdidc avatar Jul 26 '21 01:07 mehdidc

"so no optimization loop" - does that mean there's no 500x iterations to get a good looking image?

fyi - @nerdyRodent

johndpope avatar Jul 26 '21 01:07 johndpope

" does that mean there's no 500x iterations to get a good looking image?" Yes

mehdidc avatar Jul 26 '21 01:07 mehdidc

Following the tweet you mentioned above, here is an example with "deviantart, volcano": https://imgur.com/a/cYMsNo5 with a model currently being trained on conceptual captions 12m.

mehdidc avatar Jul 26 '21 01:07 mehdidc

@johndpope I added a bunch of pre-trained models if you want to give it a try

mehdidc avatar Jul 27 '21 15:07 mehdidc

I had a play with the 1.7gb cc12m_32x1024 - I couldn't get my high quality that I was getting on VQGAN-CLIP - will keep trying - bumping the dimensions. Maybe docs could use some pointers - 256 x256 / 512x512 etc One thing is clear - this can perform very quickly - perhaps efforts to have this provide a hot serving whereby you could give it a new prompt / running a service / almost in realtime without turning off the engine so to speak. We talk about FPS - frames per second - could we see a VQPS ???

Here's some images I turned out over the weekend - https://github.com/nerdyrodent/VQGAN-CLIP/issues/13

Observerations When I threw in a parameter - it was clearly identifable. Los Angeles | 35mm Eg. https://twitter.com/johndpope/status/1419352229031518209/photo/1

Los Angeles Album Cover https://twitter.com/johndpope/status/1419354082192412679/photo/1

This didn't quite cut it. python -u main.py test pretrained_models/cc12m_32x1024/model.th "los angeles album cover"

Other improvements for newbies - you could consider integrating these downloads into readme https://github.com/nerdyrodent/VQGAN-CLIP/blob/5edb6a133944ee735025b8a92f6432d6c5fbf5eb/download_models.sh

johndpope avatar Jul 29 '21 05:07 johndpope

@johndpope have you considered re-embedding the outputs from the trained vitgan as clip image-embeds; and then using those as prompts to a "normal" VQGAN-CLIP optimization with a much higher learning rate than usual and fewer steps? That will allow you to use non-square dimensions.

Also - one of the other primary benefits of this approach is that if you'd like to finetune from one of the checkpoints or even train your own from scratch - this can be relatively simple as all you need are some captions which can be generated/typed out. You'll want to cover a large-ish corpus but using something like the provided MIT states captions as a base should be a good start.

Thanks for the extra info. I'm a little busy today but I think the README might need one or two more things and possibly a colab notebook specific to training (if we don't have that already) that would make it easy to customize MIT states.

edit: realtime updates to your captions/display of rate of generations etc. may be outside of the scope of the project.

afiaka87 avatar Jul 29 '21 17:07 afiaka87