StoryDiffusion icon indicating copy to clipboard operation
StoryDiffusion copied to clipboard

whats the difference between the Consistent self-attention and Cross Frame attention in Text2Video-Zero?

Open Yushuyang1994 opened this issue 1 year ago • 2 comments

It seems they are somehow similar and could you please describe the difference between them? Thank you!

Yushuyang1994 avatar May 17 '24 06:05 Yushuyang1994

I searched entire repo and code for "Text2Video-Zero" the video weights have still not been released and I don't see any code related to the text to video yet. The dev said it's just for comic generation for now in another comment. Not sure where you are seeing this?

311-code avatar May 17 '24 06:05 311-code

Thank you for your attention. Both Consistent Self-Attention and Cross-Frame Attention make use of the key and value from self-attention, which was also introduced in Imagen. However, the subjects and purposes of their self-attention operations differ. Cross-frame attention is applied to video generation models, utilizing the first frame as a reference image, while Consistent Self-Attention is based on image generation models, leveraging sampled tokens from various character images to facilitate interaction among character features, thus ensuring character consistency. We will update our paper to make readers more aware of this distinction.

Z-YuPeng avatar May 17 '24 06:05 Z-YuPeng