video-diffusion-pytorch Conditioning on image + text embedding

Conditioning on image + text embedding

Open ChintanTrivedi opened this issue 2 years ago • 4 comments

Looking for pointers to get started on modifying the conditioning code below to include conditioning on an image along with text.

videos = torch.randn(2, 3, 5, 32, 32) # video (batch, channels, frames, height, width)
text = torch.randn(2, 64)             # assume output of BERT-large has dimension of 64
loss = diffusion(videos, cond = text)

So far I am trying to condition on CLIP embeddings

videos = torch.randn(2, 3, 5, 32, 32) # video (batch, channels, frames, height, width)
image_emb = torch.randn(2, 512) # image (batch, CLIP ViT32 latent representation)
text_emb = torch.randn(2, 64) # assume output of BERT-large has dimension of 64

cond_emb = torch.cat((image_emb, text_emb),dim=1) # combining both image and text inputs to the video diffusion condition

loss = diffusion(videos, cond = cond_emb)

However, is there a better way to condition on images in the pixel space rather than latent representations? This might also help to use this in an autoregressive manner for last frame of the diffusion sample as input condition for the next sample.

PS: Thanks Phil for the quick implementation of an interesting paper that doesnt have the official code out yet!

May 12 '22 07:05 ChintanTrivedi

I think you can try concatenating the image directly to the video frames in the channel dim. That was what SR3 (a paper using image diffusion for image super-resolution) did.

May 18 '22 02:05 zkx06111

Thanks @zkx06111, I checked it out, and that makes a lot of sense. Shouldn't it be along the frames dim instead of channel since this is video conditioned on image, not image conditioned on image?

If Noise is (32,3,10,128,128) and image condition is (32,3,128,128), then the concatenated input would be (32,3,11,128,128) where image is added to the front of the first frame in noise.

May 18 '22 07:05 ChintanTrivedi

@ChintanTrivedi Did you had success with that?

Jul 25 '22 07:07 oxjohanndiep

How do you condition (image/gif + text) on a custom input, the model should be loaded from already saved milestones/checkpoints in "./results/" folder.

Thank you.

Oct 18 '23 07:10 chpk

video-diffusion-pytorch video-diffusion-pytorch copied to clipboard

Conditioning on image + text embedding

video-diffusion-pytorch
video-diffusion-pytorch copied to clipboard