Why is the logic for long video generation not entirely consistent with what is described in the paper?
Hello, thanks for such great algorithms and code!
I noticed in your paper that you mentioned two control methods: "UniAnimate supports human video animation using only a reference image and a target pose sequence, as well as the input of a first frame."
For long videos, you mentioned: "For subsequent segments, we use the reference image along with the first frame of the previous segment to initiate the next generation."
Does this mean that during inference for subsequent windows, I should use the last frame of the previous result as the reference image, or should I just pass it in as local_image?
Additionally, I noticed in your code that long video processing still uses the overlap approach. What is the reasoning behind this choice?
Thank you
Do you have any tips for this? I have same question about long videos.