The advantage compared with ConsiStory.
Hi, StoryDiffusion is a nice work about customized generation and thanks for the open source code. Can you describe the differences and advantages between the proposed CSA and ConsiStory [1] ?
[1] Training-Free Consistent Text-to-Image Generation https://arxiv.org/abs/2402.03286
Hi, Our StoryDiffusion is submitted to a conference at end of february,2024, Just not posted on arxiv, this could be concurrent work.
One of the differences I'm seeing (If I understand it correctly) is that consiStory uses a subject driven masking technique for cross image sampling while storydiffusion takes the whole (or semi randomized parts of a whole) context of the other images. So ConsiStory would be better at maintaining the subject consistency but would be less viable for video, because it doesn't pay attention to the background consistency.
Hi, StoryDiffusion is a nice work about customized generation and thanks for the open source code. Can you describe the differences and advantages between the proposed CSA and ConsiStory [1] ?
[1] Training-Free Consistent Text-to-Image Generation https://arxiv.org/abs/2402.03286
Hi,
Thanks for your interest and thank you for bringing the related work. The differences are listed below for your reference:
- The functionality is different. As already pointed out by William, Subject Driven SA (SDSA) in ConsiStory needs a subject mask while StoryDiffusion is not constrained to use a mask. The benefits of StoryDiffusion are in the context of storytelling and video generation where not only the subject ID but also the overall context such as the background and the attres, etc. are maintained consistently.
- The operator design is different. ConsiStory's SA operator needs to use all images in the batch and thus might suffer from the memory issue: the total number of images cannot be large due to memory constraints. Differently, StoryDiffusion implements a window-based SA operator that allows us to generate infinite long images/videos without memory constraints. Besides, StoryDiffusion operates at the token level while ConsiStory operates at the query level.
- As pointed out by Yupeng, the StoryDiffusion has been completed at Feb, 2024. We agree that ConsiStory is a related work and we will add that into discussion in our updated version in the arxiv.
I hope this clarifies your concerns.
Best regards, DQ
Thanks for the replies. Of course, StoryDiffusion and ConsiStory are concurrent work. They are both nice. I opened this issue not to question this, but to discuss the method itself.
CSA is indeed more friendly for the following video generation process, while SDSA focused on the consistency of specified subject to generate images with diverse background. Your replies made this clear to me.
I have one more question about stage 2 in Story Diffusion. Why should we use CLIP semantic features for interpolation? It seems that texture features are better? Just like the first and last frames shown in fig. 3, their CLIP semantic features seem to be very close (e.g. a man walking on the road).
This is a good question. I would like to explain through the following aspects:
-
First of all, it is not exactly a clip encoder. We did not obtain the final CLIP pooling embedding, but extracted features from the middle layer instead of the final pooling, which perseveres certain structural information.
-
Predicting the middle frame only relies on temporal attention is limited because temporal attention only calculates along the time dimension at the same spatial position.
-
From the perspective of dimensionality reduction, the embedding of image semantic space implicitly contains information such as the pose of the person in the form of an implicit vector. In fact, it is equivalent to avoiding predicting directly from a high-dimensional space (natural image space) and predicting from a low-dimensional space (image semantic space).
Got it, thanks.