StoryGen Selection of reference images

Thank you for your excellent work! In a previous issue raised by someone else, you mentioned that if one wants to generate a continuous story, they can first generate the first frame of the story in single-frame mode (set stage = 'no'). Then, use the generated frame as the ref_image for generating the next frame of the story, and iteratively generate a coherent and complete story in an autoregressive manner. I have a question: for the story generation task, if I set stage='no' to generate the first frame, and then use the first frame as the reference path to generate the second frame, should the reference path for the third frame be changed to the paths of the first and second frames? If so, would the reference path for the seventh frame be the paths of the first six frames? Or is the reference path always the path of the first generated frame? Or is the reference path the path of the previous frame? (But I personally tried this and felt that the effect was not good.) I hope to get your prompt reply. I'm sorry to bother you! (The attached image is a reproduction of the example in your paper. And my reference image is always fixed to only the first image. I'm not sure if this is correct~)

Mar 25 '25 05:03 Sunny9998

Thank you for your question. I will address it based on the following points:

Input Transformation: Please refer to the guidelines in previous issues (https://github.com/haoningwu3639/StoryGen/issues/10#issuecomment-2002906594 and https://github.com/haoningwu3639/StoryGen/issues/14#issuecomment-2021797561). You need to convert the narrative texts in the examples into descriptive prompts that are better suited for StoryGen as input.
Autoregressive Generation: Your understanding of the autoregressive process is correct. However, due to GPU computational limitations, we only use up to 3 previous frames as contextual conditions for subsequent generation steps.

Mar 25 '25 07:03 haoningwu3639

Thank you for your attention. I'd like to add the following points:

Please refer to Section G.1 in the paper's appendix, which proves that the number of conditioning frames doesn't have a significant impact on the results. Therefore, it's acceptable to either use the previous three frames as conditioning frames or consistently use the first frame as the conditioning frame. Using the first three frames can introduce more contextual information, while using the first frame can mitigate quality degradation during the generation process. You can make a choice based on your actual situation.
A suitable first frame is crucial for the generation quality, as it will determine the upper bound of the quality throughout the entire generation process. When generating the first frame, you can generate it multiple times to select the result with the highest quality.

Mar 25 '25 09:03 Verg-Avesta