diffusers [WIP] VQ-diffusion

[WIP] VQ-diffusion

Open williamberman opened this issue 1 year ago • 8 comments

NOTE: updating PR description and moving verification code to notebook

Porting the VQ-diffusion VQVAE for the ITHQ dataset to diffusers.

Add convert_vq_diffusion_to_diffusers.py script: This script initially only converts the VQVAE to diffusers. It will be updated to convert the whole model.

Add placeholder VQDiffusionPipeline: The VQDiffusionPipeline is added as a placeholder to wrap the vqvae so it can be used in the convert_vq_diffusion_to_diffusers.py script to save the ported model.

Add ConvAttentionBlock: The VQVAE used for ITHQ in VQ-diffusion uses a slightly different attention block than the one already in diffusers. The ConvAttentionBlock uses torch.nn.Conv2d's for its linear layers as opposed to torch.nn.Linear's. There are a few other minor discrepancies between the two attention blocks.

Add specify dimmension of embeddings to VQModel: VQModel will by default set the dimension of embeddings to the number of latent channels. The VQ-diffusion VQVAE for ITHQ has a smaller embedding dimension, 128, than number of latent channels, 256.

Testing the ported VQVAE

To test the model, we can encode and reconstruct images in both repositories.

Run the original VQVAE

In the root directory of https://github.com/microsoft/VQ-Diffusion. I did this in a separate conda environment. To run the install_req.sh script, I had to change its line endings to unix.

# Download ITHQ VQVAE weights
$ wget https://facevcstandard.blob.core.windows.net/v-zhictang/Improved-VQ-Diffusion_model_release/ithq_vqvae.pth\?sv\=2020-10-02\&st\=2022-05-30T15%3A17%3A18Z\&se\=2030-05-31T15%3A17%3A00Z\&sr\=b\&sp\=r\&sig\=1jVavHFPpUjDs%2FTO1V3PTezaNbPp2Nx8MxiWI7y6fEY%3D -O OUTPUT/pretrained_model/taming_dvae/ithq_vqvae.pth

# Download Image
$ wget https://news.artnet.com/app/news-upload/2019/01/Cat-Photog-Feat-256x256.jpg -O cat.jpg

Run the following python snippet to reconstruct ./cat.jpg to ./cat-reconstructed.tiff

from image_synthesis.modeling.codecs.image_codec.ema_vqvae import PatchVQVAE
import PIL
import numpy as np
import torch

input_file_name = "./cat.jpg"
output_file_name = "./cat-reconstructed.tiff"

image = PIL.Image.open(input_file_name).convert("RGB")
image = np.array(image)[None].transpose(0, 3, 1, 2)
image = torch.from_numpy(image)

vqvae = PatchVQVAE(trainable=False, token_shape=[32, 32])

encoded = vqvae.get_tokens(image)['token']

reconstructed = vqvae.decode(encoded)[0].round().cpu().numpy().astype("uint8")

PIL.Image.fromarray(reconstructed.transpose(1, 2, 0)).save(output_file_name, format="tiff", compression=None)

print(f"reconstructed image written to: {output_file_name}")

Run the ported VQVAE and compare

In the root directory of diffusers.

# Download the checkpoint
$ wget https://facevcstandard.blob.core.windows.net/v-zhictang/Improved-VQ-Diffusion_model_release/ithq_vqvae.pth?sv=2020-10-02&st=2022-05-30T15%3A17%3A18Z&se=2030-05-31T15%3A17%3A00Z&sr=b&sp=r&sig=1jVavHFPpUjDs%2FTO1V3PTezaNbPp2Nx8MxiWI7y6fEY%3D -O ithq_vqvae.pth

# Download the config
# NOTE that in VQ-diffusion the documented file is `configs/ithq.yaml` but the target class 
# `image_synthesis.modeling.codecs.image_codec.ema_vqvae.PatchVQVAE`
# loads `OUTPUT/pretrained_model/taming_dvae/config.yaml`
$ wget https://raw.githubusercontent.com/microsoft/VQ-Diffusion/main/OUTPUT/pretrained_model/taming_dvae/config.yaml -O ithq_vqvae.yaml

# Download Image
$ wget https://news.artnet.com/app/news-upload/2019/01/Cat-Photog-Feat-256x256.jpg -O cat.jpg

# run the convert script
$ python ./scripts/convert_vq_diffusion_to_diffusers.py \
    --checkpoint_path ./ithq_vqvae.pth \
    --original_config_file ./ithq_vqvae.yaml \
    --dump_path ./vqdiffusion_vqvae_pretrained \
    --only-vqvae

Run the following python snippet to reconstruct with the ported model and to compare against the reconstructed image from the original model. Replace compare_reconstructed_against_file_name with the path to the reconstructed image from the original model.

Running on my cpu, I get 22 pixels differ (w/in a channel) by at most 1.

from diffusers.pipelines import VQDiffusionPipeline
import PIL
import numpy as np

input_file_name = "./cat.jpg"
output_file_name = "./cat-reconstructed.tiff" 
compare_reconstructed_against_file_name = "<path to reconstructed image from microsoft/VQ-Diffusion>"

image = PIL.Image.open(input_file_name).convert("RGB")

vq_diffusion_pipeline = VQDiffusionPipeline.from_pretrained('./vq_diffusion_vqvae_pretrained')

encoded = vq_diffusion_pipeline.encode(image)
reconstructed = vq_diffusion_pipeline.decode(encoded)

reconstructed[0].save(output_file_name, format='tiff', compression=None)

print(f"reconstructed image written to: {output_file_name}")

reconstructed = np.array(reconstructed[0]).transpose(2, 0, 1)
compare_reconstructed_against = np.array(PIL.Image.open(compare_reconstructed_against_file_name).convert("RGB")).transpose(2, 0, 1)

diff = reconstructed - compare_reconstructed_against
mask = diff != 0

num_different = np.count_nonzero(diff)
# larger signed ints to not overflow when subtracting
diff_in_locations_where_diffs = reconstructed[mask].astype('int32') - compare_reconstructed_against[mask].astype('int32')
max_diff = np.max(np.abs(diff_in_locations_where_diffs))

if (diff == 0).all():
    print(f"reconstructed {input_file_name} is equal to {compare_reconstructed_against_file_name}")
else:
    print(f"reconstructed {input_file_name} differs from {compare_reconstructed_against_file_name}")
    print(f"number pixels different: {num_different}")
    print(f"pixels differ by at max: {max_diff}")

Original cat

cat

Reconstructed cat

cat-reconstructed

Sep 27 '22 19:09 williamberman

The documentation is not available anymore as the PR was closed or merged.

Sep 27 '22 19:09 HuggingFaceDocBuilderDev

Wooow - this looks amazing! @patil-suraj mind taking a look here?

Sep 29 '22 17:09 patrickvonplaten

Wooow - this looks amazing! @patil-suraj mind taking a look here?

Thanks! Would love to know if the minimum unit to merge is the completed pipeline or if smaller chunks like this are acceptable to merge.

Sep 29 '22 18:09 williamberman

Hey @williamberman and @345ishaan,

Sorry for the delay - I have more time going forward and think we can merge this model by next week!

It's a super nice PR. Extremely easy to understand and to follow - thanks a bunch!

I've fiddled a bit into the PR to make it a bit more light-weight :-) We don't really need a new attention layer as Conv2D layers when used for attention are just like linear layers for which an attention class already exists.

That's why I removed the Conv2DAttention class and changed the conversion script slightly so that your script still works as expected. I'm getting visually identical reconstruction images, so I think we're good with the linear attention layer that was already implemented (could you maybe double check?).

Now I think in a next step we can implement the U-Net and scheduler for the forward pass, no? Do you need help/guidance here?

Oct 07 '22 22:10 patrickvonplaten

Hey @williamberman and @345ishaan,

Sorry for the delay - I have more time going forward and think we can merge this model by next week!

It's a super nice PR. Extremely easy to understand and to follow - thanks a bunch!

I've fiddled a bit into the PR to make it a bit more light-weight :-) We don't really need a new attention layer as Conv2D layers when used for attention are just like linear layers for which an attention class already exists.

That's why I removed the Conv2DAttention class and changed the conversion script slightly so that your script still works as expected. I'm getting visually identical reconstruction images, so I think we're good with the linear attention layer that was already implemented (could you maybe double check?).

Now I think in a next step we can implement the U-Net and scheduler for the forward pass, no? Do you need help/guidance here?

Thanks! Pinged in discord as well but the model has a transformer (just the encoder iirc) for the reverse diffusion process instead of a unet. I have the transformer ported on another branch. I think the open question is would you prefer that on this PR or to merge this PR first and then merge the transformer on a separate PR?

Oct 07 '22 22:10 williamberman

Hey @williamberman and @345ishaan, Sorry for the delay - I have more time going forward and think we can merge this model by next week! It's a super nice PR. Extremely easy to understand and to follow - thanks a bunch! I've fiddled a bit into the PR to make it a bit more light-weight :-) We don't really need a new attention layer as Conv2D layers when used for attention are just like linear layers for which an attention class already exists. That's why I removed the Conv2DAttention class and changed the conversion script slightly so that your script still works as expected. I'm getting visually identical reconstruction images, so I think we're good with the linear attention layer that was already implemented (could you maybe double check?). Now I think in a next step we can implement the U-Net and scheduler for the forward pass, no? Do you need help/guidance here?

Thanks! Pinged in discord as well but the model has a transformer (just the encoder iirc) for the reverse diffusion process instead of a unet. I have the transformer ported on another branch. I think the open question is would you prefer that on this PR or to merge this PR first and then merge the transformer on a separate PR?

Hey @williamberman,

Great to hear that you already have it on a branch. Could you maybe add it directly to this PR? Maybe in a next step we could verify that a forward pass through the transformer (replacement of the unet) gives identical results to the official implementation. If that works, we can integrate the scheduler and then in a last step make a whole denoising process work :-)

Overall, everything should ideally be in this PR. Since VQ-diffusion will be one of our first community pipeline contributions, there are lots of new things in this PR and I'm more than happy to help you with it (don't hesitate to ping me :-))

Oct 10 '22 10:10 patrickvonplaten

Great to hear that you already have it on a branch. Could you maybe add it directly to this PR? Maybe in a next step we could verify that a forward pass through the transformer (replacement of the unet) gives identical results to the official implementation. If that works, we can integrate the scheduler and then in a last step make a whole denoising process work :-)

Overall, everything should ideally be in this PR. Since VQ-diffusion will be one of our first community pipeline contributions, there are lots of new things in this PR and I'm more than happy to help you with it (don't hesitate to ping me :-))

SG!

Follow up:

Merge transformer into this branch.
Add script I've been using to test transformer to pr description.
Merge CLIP in pipeline/convert script for text embeddings into this branch
Add script for testing CLIP to pr description
Add initial skeleton for scheduler/pipeline to this branch (I also have this on branch with transformer)

Oct 10 '22 16:10 williamberman

Great to hear that you already have it on a branch. Could you maybe add it directly to this PR? Maybe in a next step we could verify that a forward pass through the transformer (replacement of the unet) gives identical results to the official implementation. If that works, we can integrate the scheduler and then in a last step make a whole denoising process work :-) Overall, everything should ideally be in this PR. Since VQ-diffusion will be one of our first community pipeline contributions, there are lots of new things in this PR and I'm more than happy to help you with it (don't hesitate to ping me :-))

SG!

Follow up:

Merge transformer into this branch.

Add script I've been using to test transformer to pr description.

Merge CLIP in pipeline/convert script for text embeddings into this branch

Add script for testing CLIP to pr description

Add initial skeleton for scheduler/pipeline to this branch (I also have this on branch with transformer)

This sounds like a great plan!

Oct 11 '22 18:10 patrickvonplaten

PR description is updated to reflect progress on merging full model. Notebook which compares outputs from autoencoder, transformer, and text embedder is here and linked in PR description https://github.com/williamberman/vq-diffusion-notebook. Once the scheduler is complete, will also add it to the notebook :)

Oct 12 '22 16:10 williamberman

@williamberman let me know if you'd like me to review already now or better once the scheduler is integrated as well :-)

Oct 14 '22 17:10 patrickvonplaten

Great progress!

Oct 14 '22 17:10 patrickvonplaten

@williamberman let me know if you'd like me to review already now or better once the scheduler is integrated as well :-)

Let’s wait until the scheduler is integrated. I cleaned some non scheduler components up while working on it that I’d like to add to this branch first :)

Oct 14 '22 17:10 williamberman

The working scheduler is on this branch now. Going to do some more cleaning and docs before requesting formal review :)

Oct 20 '22 10:10 williamberman

@patrickvonplaten Ok, I think this is ready for reviews! I updated the notebook to compare latents during the denoising process and to compare the final images. The current model is definitely not pixel for pixel the exact same as the original model, but the output images are very similar (see example in notebook). I’m not sure what the diffusers standard for replicating models is here but happy to do some more digging on what could be causing the differences — i.e. it looks like the vendored clip in the original repo produces slightly different embeddings than the clip from transformers we use here.

Going to take a look at what diffusers requires in terms of tests and add what’s necessary but don’t think that should be a blocker from starting some reviews :)

Original: download

Diffusers port: download (1)

Oct 24 '22 20:10 williamberman

Very cool! Will try to review this week :-)

Oct 26 '22 12:10 patrickvonplaten

This PR is already in a great shape!

The main final change would be to try to use existing modules from attention.py to a maximum (by adapting them and adding more configurable parameters). This is quite difficult especially given that this code is still quite new. If you like, I could go into the PR here and help you a bit :-)

Apart from this I think we can merge this PR this week (sorry to be so late here!)

Could you also add the scheduler and pipeline to the docs under:

https://github.com/huggingface/diffusers/tree/main/docs/source/api/pipelines

https://github.com/huggingface/diffusers/blob/main/docs/source/api/schedulers.mdx ?

Super exciting and very nice work. I think we could have this ready for the release on Thursday - happy to help get this merged on Wednesday :-) (cc @anton-l @pcuenca @patil-suraj )

Thank you! Yep I'm happy to do the work to merge into the existing attention components. Should be doable before thursday :)

Oct 31 '22 18:10 williamberman

@patrickvonplaten Ok, I merged the changes into attention.py so the model uses SpatialTransformer. I tried to follow the existing coding style as best as possible. Let me know if you want anything changed there :)

I also added tests for the changed models, the scheduler, and the whole pipeline. Also lmk if I missed another part of the codebase I'm supposed to add tests to.

It looks like the failing tests are from a different PR that was merged recently

Nov 02 '22 07:11 williamberman

Hey @williamberman,

You've really done an amazing job here! Very impressed by how you were able to add such a new complex model into the existing API!

The conversion script and your notebook is very easy to follow :-)

I've uploaded the ithq model now to the microsoft org here: https://huggingface.co/microsoft/vq-diffusion-ithq and added a slow test that makes sure the model works as expected. Besides that I've done some minor naming changes.

The failing tests are unrelated and we can merge this PR more or less already as is. If ok for you, I would do some final renaming changes tomorrow morning (Paris time) to make it fit a bit better with existing configuration names (will have to sync with @patil-suraj @anton-l and @pcuenca ) and then we can merge this to be in the next release IMO.

@patil-suraj @pcuenca @anton-l could you maybe already review this PR? IMO, besides some re-naming it's ready! I've also made sure that all existing slow tests are passing!

@williamberman if you're interesting we could do the following follow-up projects to promote this model a bit more:

Write a short blog post about this model and put in on https://huggingface.co/blog (If you want you could open a blog here: https://github.com/huggingface/blog) - I think the community might be really interested in finding out the difference between latent diffusion models and this vq-diffusion model :-)
Make the model card a bit nicer: https://huggingface.co/microsoft/vq-diffusion-ithq (if you want you could open a PR on the repo)
Send me an email to [email protected] and I could connect you to the authors of vq-diffusion model (we could sync with them a bit on promoting this integration on Twitter/Linked-In if you want :-) )

Obviously no need to do any of the above points if you don't want or are too busy! Regarding this PR, I think we can merge it tomorrow morning!

Really great job here :rocket:

Nov 02 '22 19:11 patrickvonplaten

Hey @williamberman,

You've really done an amazing job here! Very impressed by how you were able to add such a new complex model into the existing API!

The conversion script and your notebook is very easy to follow :-)

I've uploaded the ithq model now to the microsoft org here: https://huggingface.co/microsoft/vq-diffusion-ithq and added a slow test that makes sure the model works as expected. Besides that I've done some minor naming changes.

The failing tests are unrelated and we can merge this PR more or less already as is. If ok for you, I would do some final renaming changes tomorrow morning (Paris time) to make it fit a bit better with existing configuration names (will have to sync with @patil-suraj @anton-l and @pcuenca ) and then we can merge this to be in the next release IMO.

@patil-suraj @pcuenca @anton-l could you maybe already review this PR? IMO, besides some re-naming it's ready! I've also made sure that all existing slow tests are passing!

@williamberman if you're interesting we could do the following follow-up projects to promote this model a bit more:

Write a short blog post about this model and put in on https://huggingface.co/blog (If you want you could open a blog here: https://github.com/huggingface/blog) - I think the community might be really interested in finding out the difference between latent diffusion models and this vq-diffusion model :-)

Make the model card a bit nicer: https://huggingface.co/microsoft/vq-diffusion-ithq (if you want you could open a PR on the repo)

Send me an email to [email protected] and I could connect you to the authors of vq-diffusion model (we could sync with them a bit on promoting this integration on Twitter/Linked-In if you want :-) )

Obviously no need to do any of the above points if you don't want or are too busy! Regarding this PR, I think we can merge it tomorrow morning!

Really great job here 🚀

Awesome all sound good!

I think a blog post sounds great, sent you an email :)

Nov 02 '22 22:11 williamberman

@williamberman I am wondering did you try training it or just verified in infer mode?

Nov 03 '22 05:11 345ishaan

@williamberman I am wondering did you try training it or just verified in infer mode?

Just inference using the weights microsoft published. Training would have been a good amount more work 😅

Nov 03 '22 06:11 williamberman

diffusers diffusers copied to clipboard

[WIP] VQ-diffusion

Porting the VQ-diffusion VQVAE for the ITHQ dataset to diffusers.

Testing the ported VQVAE

Run the original VQVAE

Run the ported VQVAE and compare

Original cat

Reconstructed cat

diffusers
diffusers copied to clipboard