diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

[Community Pipeline] UnCLIP image / text interpolations

Open patrickvonplaten opened this issue 2 years ago • 24 comments

Model/Pipeline/Scheduler description

Copied from https://github.com/huggingface/diffusers/pull/1858:

UnCLIP / Karlo: https://huggingface.co/spaces/kakaobrain/karlo gives some very nice and precise results when doing image generation and can strongly outperform Stable Diffusion in some - see: https://www.reddit.com/r/StableDiffusion/comments/zshufz/karlo_the_first_large_scale_open_source_dalle_2/

Another extremely interesting aspect of Dalle 2 is its ability to interpolate between text and or image embeddings. See e.g. section 3.) of the Dalle 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf . This PR now allows to directly pass text embeddings and image embeddings which should enable those tasks!

I think we could create a super cool community pipeline. The pipeline could allow to automatically create interpolations between two text prompts and similarly we could create one to do interpolations between two images.

In terms of design to stay as efficient as possible the following would make sense:

    1. The user passes two text prompts and a num_interpolations input.
    1. The pipeline then embeds those two text prompts into the text embeddings x_0 and x_N and num_interpolations x_1, x_2, ... x_N-1 are created using the slerp function .
    1. Then we have num_interpolations + 2 text embeddings that should be passed in a batch through the model to create a nice interpolation of images.
    1. It'd be important to make use of enable_cpu_offload() to save memory.

It's probably easier to start with the UnCLIPImageInterpolationPipeline since image embeddings are just a single 1-d vector where as for text embeddings two latent vectors are used.

Would be more than happy to help if someone is interested in giving this a try - think it'll make for some super cool demos.

Open source status

  • [X] The model implementation is available
  • [X] The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

No response

patrickvonplaten avatar Dec 30 '22 11:12 patrickvonplaten

Hi @patrickvonplaten and @williamberman, are you working on this or can I pick this up ?

Abhinay1997 avatar Jan 01 '23 15:01 Abhinay1997

@Abhinay1997 feel free to pick it up! We re more than happy to help if needed ☺️

patrickvonplaten avatar Jan 01 '23 15:01 patrickvonplaten

This makes sense at the top level. But just so that my understanding is correct, we want a community pipeline that can interpolate between prompts/images like the StableDiffusionInterpolation but using the unCLIPPipeline ( a.k.a Dall-E 2)

The interpolation also makes sense, we generate the embeddings for the 2 prompts, p1 and p2 let's say. They would correspond to x_0 and x_N of the interpolation sequence and using slerp I would interpolate between them for N outputs in total for a pair of prompts/images.

Abhinay1997 avatar Jan 05 '23 01:01 Abhinay1997

That's exactly right @Abhinay1997 :-)

patrickvonplaten avatar Jan 10 '23 15:01 patrickvonplaten

@patrickvonplaten sorry it took so long.

The UnCLIPTextInterpolation pipeline is actually more straight forward imo. For the UnCLIPImageInterpolation pipeline, would we not need the CLIP model that was used for training the UnCLIPPipeline ?

Because I see the CLIP model in the original codebase but not in the HuggingFace hub model

I am working on the text interpolation right now. ETA: 24th Jan.

Abhinay1997 avatar Jan 20 '23 15:01 Abhinay1997

cc @williamberman here

patrickvonplaten avatar Jan 23 '23 06:01 patrickvonplaten

Thanks for your work @Abhinay1997!

We do still use clip for the unclip pipeline(s). Note that we just import it from transformers.

In the text to image pipeline, we use the text encoder for encoding the prompt https://github.com/huggingface/diffusers/blob/ac3fc649066df9d347df519b3d0877d41fb847b1/src/diffusers/pipelines/unclip/pipeline_unclip.py#L70-L71

In the image variation pipeline, we use the image encoder for encoding the input image and the text encoder for encoding an empty prompt. Note that we also optionally allow directly passing image embedding to the pipeline to skip encoding an input image. https://github.com/huggingface/diffusers/blob/ac3fc649066df9d347df519b3d0877d41fb847b1/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py#L75-L78

The image interpolation pipeline should work similarly to the image variation pipeline in that it should be passed either the two sets of images which would be encoded via clip or the two sets of pre-encoded latents. Note that similarly to the image variation pipeline, we would have to encode the empty text prompt for the image interpolation pipeline.

LMK if that makes sense!

williamberman avatar Jan 23 '23 21:01 williamberman

Hi @williamberman thank you for the details. It makes sense. I was under the impression that I needed to use the actual CLIP checkpoint that the UnCLIP model learns to invert its decoder over. So I got confused.

Will share the work in progress notebooks soon.

Abhinay1997 avatar Jan 25 '23 09:01 Abhinay1997

Hi @williamberman can you review the UnCLIPTextInterpolation notebook when you have time ?

Questions:-

  1. When prompts have different lengths, what attention mask should be used for the intermediate interpolation steps ?
  2. Will I be able to instantiate this community pipeline using DiffusionPipeline(custom_pipelein='....') as I am inheriting from both DiffusionPipeline[ CommunityPipeline requirement] and UnCLIPPipeline[ to be able to use UnCLIP modules] ?

Abhinay1997 avatar Jan 31 '23 16:01 Abhinay1997

Great work so far @Abhinay1997 !

Hi @williamberman can you review the UnCLIPTextInterpolation notebook when you have time ?

Questions:-

  1. When prompts have different lengths, what attention mask should be used for the intermediate interpolation steps ?

That's a good point, and I don't know off the top of my head or from googling around. I would recommend for now just using the mask of the longer prompt. cc @patrickvonplaten

  1. Will I be able to instantiate this community pipeline using DiffusionPipeline(custom_pipelein='....') as I am inheriting from both DiffusionPipeline[ CommunityPipeline requirement] and UnCLIPPipeline[ to be able to use UnCLIP modules] ?

We actually want all pipelines to be completely independent so please do not inherit from the UnCLIPPipeline :)

williamberman avatar Feb 01 '23 18:02 williamberman

@williamberman, just to clarify, can I still import UnCLIPPipeline inside the methods and use it for generation ?

Abhinay1997 avatar Feb 01 '23 18:02 Abhinay1997

Nope!

We want all pipelines to be as self contained as possible. If any methods are exactly the same, we have the # Copied from mechanism (which we should document a bit better) which will let you copy and paste the method and keep the two methods in sync in ci.

We've articulated some of our rationale in the philosophy doc

https://github.com/huggingface/diffusers/blob/main/docs/source/en/conceptual/philosophy.mdx#pipelines https://github.com/huggingface/diffusers/blob/main/docs/source/en/conceptual/philosophy.mdx#tweakable-contributor-friendly-over-abstraction

williamberman avatar Feb 01 '23 18:02 williamberman

Thanks for the clarification @williamberman. Will update and make the PR for it soon.

Abhinay1997 avatar Feb 01 '23 18:02 Abhinay1997

@williamberman @patrickvonplaten Please find the PR for UnCLIPTextInterpolation: https://github.com/huggingface/diffusers/pull/2257

Also, what about the interpolation attention_masks ? Any other thoughts on it ? Using max for now, as suggested.

P.S: Planning to complete UnCLIPImageInterpolation this week

Abhinay1997 avatar Feb 06 '23 11:02 Abhinay1997

With the text interpolation pipeline merged, the image interpolation pipeline is still up for grabs!

williamberman avatar Feb 13 '23 06:02 williamberman

@williamberman I was planning to start on image interpolation too. Would that be okay ?

Abhinay1997 avatar Feb 13 '23 06:02 Abhinay1997

Yes please! You're on a roll :)

williamberman avatar Feb 13 '23 06:02 williamberman

UnCLIP Text Interpolation Space: https://huggingface.co/spaces/NagaSaiAbhinay/unclip_text_interpolation_demo

Abhinay1997 avatar Feb 14 '23 07:02 Abhinay1997

So, I was re-reading the paper of Dall-E 2 and found that their text interpolation is a little more complicated in that they interpolate on image embeddings using a normalised difference of the text embeddings of the two prompts. This produces much better results than my implementation. I'll update the text interpolation pipeline once the image interpolation is done.

Abhinay1997 avatar Feb 14 '23 13:02 Abhinay1997

Image Interpolation is looking good. I'm getting results in line with Dall-e 2.

Notebook: https://colab.research.google.com/drive/1eN-oy3N6amFT48hhxvv02Ad5798FDvHd?usp=sharing

Results:- starry_to_dog

Inputs:- starry_night

dogs

Will open a PR tomorrow.

Abhinay1997 avatar Feb 16 '23 16:02 Abhinay1997

Wow that's super cool :fire:

patrickvonplaten avatar Feb 16 '23 18:02 patrickvonplaten

Very much looking forward to the PR! Let's maybe also try to make a cool spaces about this @apolinario @AK391 @osanseviero

patrickvonplaten avatar Feb 16 '23 18:02 patrickvonplaten

Hi @Abhinay1997 awesome work on the community pipeline, I opened a request for a community space in our discord for the community pipeline: https://discord.com/channels/879548962464493619/1075849794519572490/1075849794519572490, you can join here: https://discord.gg/pEYnj5ZW and check out the event by taking the role #collaborate and write under one of the paper posts under #making-demos forum

AK391 avatar Feb 16 '23 18:02 AK391

Opened the PR for UnCLIPImageInterpolation: https://github.com/huggingface/diffusers/pull/2400

@williamberman @patrickvonplaten

Abhinay1997 avatar Feb 17 '23 16:02 Abhinay1997

While #2400 is under review, I wanted to share the basic outline for the UnCLIP text diff flow:

  1. Take the original image x0 and generate the inverted noise xT using DDIM Inversion and the image_embeddings z_img_0
  2. Given a target prompt p_target and a caption for the original image p_start, compute text_embeddings z_txt_start and z_txt_target
  3. Compute the text diff embeddings, z_txt_diff = norm_diff(z_txt_start, z_txt_target)
  4. Compute the intermediate embedding z_inter = slerp(interp_value, z_img_0, z_txt_diff) where interp_value is linearly spaced in the interval [0.25,0.5] (from the Dall E 2 paper)
  5. Use the intermediate embeddings to generate the text diff images.

Abhinay1997 avatar Mar 03 '23 14:03 Abhinay1997

UnCLIP Image Interpolation demo space is up and running at https://huggingface.co/spaces/NagaSaiAbhinay/UnCLIP_Image_Interpolation_Demo

Do check it out !

Abhinay1997 avatar Mar 07 '23 18:03 Abhinay1997

Very cool Space 🔥

osanseviero avatar Mar 07 '23 19:03 osanseviero

Super cool space @Abhinay1997 - shared it on Reddit as well :-)

patrickvonplaten avatar Mar 08 '23 19:03 patrickvonplaten

Thanks @patrickvonplaten, @osanseviero !

Abhinay1997 avatar Mar 09 '23 02:03 Abhinay1997

Can we close this issue now?

sayakpaul avatar Mar 20 '23 03:03 sayakpaul