diffusers
diffusers copied to clipboard
[Community Pipeline] UnCLIP image / text interpolations
Model/Pipeline/Scheduler description
Copied from https://github.com/huggingface/diffusers/pull/1858:
UnCLIP / Karlo: https://huggingface.co/spaces/kakaobrain/karlo gives some very nice and precise results when doing image generation and can strongly outperform Stable Diffusion in some - see: https://www.reddit.com/r/StableDiffusion/comments/zshufz/karlo_the_first_large_scale_open_source_dalle_2/
Another extremely interesting aspect of Dalle 2 is its ability to interpolate between text and or image embeddings. See e.g. section 3.) of the Dalle 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf . This PR now allows to directly pass text embeddings and image embeddings which should enable those tasks!
I think we could create a super cool community pipeline. The pipeline could allow to automatically create interpolations between two text prompts and similarly we could create one to do interpolations between two images.
In terms of design to stay as efficient as possible the following would make sense:
-
- The user passes two text prompts and a
num_interpolations
input.
- The user passes two text prompts and a
-
- The pipeline then embeds those two text prompts into the text embeddings x_0 and x_N and
num_interpolations
x_1, x_2, ... x_N-1 are created using theslerp
function .
- The pipeline then embeds those two text prompts into the text embeddings x_0 and x_N and
-
- Then we have
num_interpolations
+ 2 text embeddings that should be passed in a batch through the model to create a nice interpolation of images.
- Then we have
-
- It'd be important to make use of
enable_cpu_offload()
to save memory.
- It'd be important to make use of
It's probably easier to start with the UnCLIPImageInterpolationPipeline
since image embeddings are just a single 1-d vector where as for text embeddings two latent vectors are used.
Would be more than happy to help if someone is interested in giving this a try - think it'll make for some super cool demos.
Open source status
- [X] The model implementation is available
- [X] The model weights are available (Only relevant if addition is not a scheduler).
Provide useful links for the implementation
No response
Hi @patrickvonplaten and @williamberman, are you working on this or can I pick this up ?
@Abhinay1997 feel free to pick it up! We re more than happy to help if needed ☺️
This makes sense at the top level. But just so that my understanding is correct, we want a community pipeline that can interpolate between prompts/images like the StableDiffusionInterpolation but using the unCLIPPipeline ( a.k.a Dall-E 2)
The interpolation also makes sense, we generate the embeddings for the 2 prompts, p1 and p2 let's say. They would correspond to x_0 and x_N of the interpolation sequence and using slerp I would interpolate between them for N outputs in total for a pair of prompts/images.
That's exactly right @Abhinay1997 :-)
@patrickvonplaten sorry it took so long.
The UnCLIPTextInterpolation pipeline is actually more straight forward imo. For the UnCLIPImageInterpolation pipeline, would we not need the CLIP model that was used for training the UnCLIPPipeline ?
Because I see the CLIP model in the original codebase but not in the HuggingFace hub model
I am working on the text interpolation right now. ETA: 24th Jan.
cc @williamberman here
Thanks for your work @Abhinay1997!
We do still use clip for the unclip pipeline(s). Note that we just import it from transformers.
In the text to image pipeline, we use the text encoder for encoding the prompt https://github.com/huggingface/diffusers/blob/ac3fc649066df9d347df519b3d0877d41fb847b1/src/diffusers/pipelines/unclip/pipeline_unclip.py#L70-L71
In the image variation pipeline, we use the image encoder for encoding the input image and the text encoder for encoding an empty prompt. Note that we also optionally allow directly passing image embedding to the pipeline to skip encoding an input image. https://github.com/huggingface/diffusers/blob/ac3fc649066df9d347df519b3d0877d41fb847b1/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py#L75-L78
The image interpolation pipeline should work similarly to the image variation pipeline in that it should be passed either the two sets of images which would be encoded via clip or the two sets of pre-encoded latents. Note that similarly to the image variation pipeline, we would have to encode the empty text prompt for the image interpolation pipeline.
LMK if that makes sense!
Hi @williamberman thank you for the details. It makes sense. I was under the impression that I needed to use the actual CLIP checkpoint that the UnCLIP model learns to invert its decoder over. So I got confused.
Will share the work in progress notebooks soon.
Hi @williamberman can you review the UnCLIPTextInterpolation notebook when you have time ?
Questions:-
- When prompts have different lengths, what attention mask should be used for the intermediate interpolation steps ?
- Will I be able to instantiate this community pipeline using
DiffusionPipeline(custom_pipelein='....')
as I am inheriting from both DiffusionPipeline[ CommunityPipeline requirement] and UnCLIPPipeline[ to be able to use UnCLIP modules] ?
Great work so far @Abhinay1997 !
Hi @williamberman can you review the UnCLIPTextInterpolation notebook when you have time ?
Questions:-
- When prompts have different lengths, what attention mask should be used for the intermediate interpolation steps ?
That's a good point, and I don't know off the top of my head or from googling around. I would recommend for now just using the mask of the longer prompt. cc @patrickvonplaten
- Will I be able to instantiate this community pipeline using
DiffusionPipeline(custom_pipelein='....')
as I am inheriting from both DiffusionPipeline[ CommunityPipeline requirement] and UnCLIPPipeline[ to be able to use UnCLIP modules] ?
We actually want all pipelines to be completely independent so please do not inherit from the UnCLIPPipeline :)
@williamberman, just to clarify, can I still import UnCLIPPipeline inside the methods and use it for generation ?
Nope!
We want all pipelines to be as self contained as possible. If any methods are exactly the same, we have the # Copied from
mechanism (which we should document a bit better) which will let you copy and paste the method and keep the two methods in sync in ci.
We've articulated some of our rationale in the philosophy doc
https://github.com/huggingface/diffusers/blob/main/docs/source/en/conceptual/philosophy.mdx#pipelines https://github.com/huggingface/diffusers/blob/main/docs/source/en/conceptual/philosophy.mdx#tweakable-contributor-friendly-over-abstraction
Thanks for the clarification @williamberman. Will update and make the PR for it soon.
@williamberman @patrickvonplaten Please find the PR for UnCLIPTextInterpolation: https://github.com/huggingface/diffusers/pull/2257
Also, what about the interpolation attention_masks ? Any other thoughts on it ? Using max for now, as suggested.
P.S: Planning to complete UnCLIPImageInterpolation this week
With the text interpolation pipeline merged, the image interpolation pipeline is still up for grabs!
@williamberman I was planning to start on image interpolation too. Would that be okay ?
Yes please! You're on a roll :)
UnCLIP Text Interpolation Space: https://huggingface.co/spaces/NagaSaiAbhinay/unclip_text_interpolation_demo
So, I was re-reading the paper of Dall-E 2 and found that their text interpolation is a little more complicated in that they interpolate on image embeddings using a normalised difference of the text embeddings of the two prompts. This produces much better results than my implementation. I'll update the text interpolation pipeline once the image interpolation is done.
Image Interpolation is looking good. I'm getting results in line with Dall-e 2.
Notebook: https://colab.research.google.com/drive/1eN-oy3N6amFT48hhxvv02Ad5798FDvHd?usp=sharing
Results:-
Inputs:-
Will open a PR tomorrow.
Wow that's super cool :fire:
Very much looking forward to the PR! Let's maybe also try to make a cool spaces about this @apolinario @AK391 @osanseviero
Hi @Abhinay1997 awesome work on the community pipeline, I opened a request for a community space in our discord for the community pipeline: https://discord.com/channels/879548962464493619/1075849794519572490/1075849794519572490, you can join here: https://discord.gg/pEYnj5ZW and check out the event by taking the role #collaborate and write under one of the paper posts under #making-demos forum
Opened the PR for UnCLIPImageInterpolation: https://github.com/huggingface/diffusers/pull/2400
@williamberman @patrickvonplaten
While #2400 is under review, I wanted to share the basic outline for the UnCLIP text diff flow:
- Take the original image
x0
and generate the inverted noisexT
using DDIM Inversion and the image_embeddingsz_img_0
- Given a target prompt
p_target
and a caption for the original imagep_start
, compute text_embeddingsz_txt_start
andz_txt_target
- Compute the text diff embeddings,
z_txt_diff
=norm_diff(z_txt_start, z_txt_target)
- Compute the intermediate embedding
z_inter = slerp(interp_value, z_img_0, z_txt_diff)
whereinterp_value
is linearly spaced in the interval [0.25,0.5] (from the Dall E 2 paper) - Use the intermediate embeddings to generate the text diff images.
UnCLIP Image Interpolation demo space is up and running at https://huggingface.co/spaces/NagaSaiAbhinay/UnCLIP_Image_Interpolation_Demo
Do check it out !
Very cool Space 🔥
Super cool space @Abhinay1997 - shared it on Reddit as well :-)
Thanks @patrickvonplaten, @osanseviero !
Can we close this issue now?