video-diffusion-pytorch
video-diffusion-pytorch copied to clipboard
Generating longer videos at test time
Thank you for quickly implementing this model @lucidrains ! Maybe you already have or are planning to do this -- "To manage the computational requirements of training our models, we only train on a small subset of say 16 frames at a time. However, at test time we can generate longer videos by extending our samples." (sec 3.1). Currently the sample() function is fixed length. Wanted to check in with you before taking a stab
You are welcome and no immediate plans yet for that portion. If you get to it first, do submit a PR :)
@mrkulk still planning on giving it a stab?
been planning to but came across a big issue in the stack trace before it. I am unable to get really good temporally coherence in the basic prediction task (gradient method is supposed to fix this but still getting choppy predictions)
On Tue, Apr 26, 2022 at 11:04 AM Phil Wang @.***> wrote:
@mrkulk https://github.com/mrkulk still planning on giving it a stab?
— Reply to this email directly, view it on GitHub https://github.com/lucidrains/video-diffusion-pytorch/issues/4#issuecomment-1110096558, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKPXKBBVVPXRX3FRWIGDDDVHAVY5ANCNFSM5TXOI2QA . You are receiving this because you were mentioned.Message ID: @.***>
@mrkulk ahh got it, ok, i'll take a look at the gradient method tomorrow
@lucidrains still training (moving mnist) and it might reach temporal coherence after more training but this is where it's at after 40k steps.
@mrkulk cool! have you tried training it with periodic arresting of attention across time as described in the paper?
i'll have to revisit the gradient method next week
@lucidrains let me start a run now and see
much more stable but there is a temporal coherence issue. actually the problem might be deeper than inference sampling -- we would expect the predictions in just the training snippets to be consistent but it is not
@mrkulk very cool! how often are you arresting the attention across time for the experiment above?
@lucidrains tried this https://github.com/lucidrains/video-diffusion-pytorch/blob/main/video_diffusion_pytorch/video_diffusion_pytorch.py#L284. Do we mean the same thing or are you referring to something else?
@mrkulk yup that's it! but what i got from the paper is that they didn't train with the arresting of attention across time exclusively? they traded off training normally vs restricting it to each frame. wasn't sure if this was done on a schedule, or alternating, or some other strategy. correct me if i'm way off base
@lucidrains it looks like they are mixing random frames at the end of a video seq -- "To implement this joint training, we concatenate random independent image frames to the end of each video sampled from the dataset, and we mask the attention in the temporal attention blocks to prevent mixing information across video frames and each individual image frame. We choose these random independent images from random videos within the same dataset;"
@lucidrains one interesting thing I noticed is that without focus_on_the_present
turned out, it has a hard time even memorizing a single video frame. It does a good job with this turned on but its choppy even after a lot of training . Trying to see if it gets better with focus_on_the_present=np.random.uniform(0, 1) > 0.5
and focus_on_the_present=True if self.global_step <= 2000 else False
.
@lucidrains one interesting thing I noticed is that without
focus_on_the_present
turned out, it has a hard time even memorizing a single video frame. It does a good job with this turned on but its choppy even after a lot of training . Trying to see if it gets better withfocus_on_the_present=np.random.uniform(0, 1) > 0.5
andfocus_on_the_present=True if self.global_step <= 2000 else False
.
ohh nice! is that generated?
@lucidrains yes but it is a overfitting test but I suspect it will work (smooth motions won't happen or there might also be mixing of digits). I also ran a schedule (2k with focus on present and then turned it off). It didn't work as expected (although I rarely see some smooth motions).
It's fine around 2k when focus on present is on but then it
diverges --
@lucidrains I am beginning to suspect its something in the Unet? Could it be attention or positional embeddings?
@mrkulk positional embedding should be fine, i'm using classic T5 relative positional bias (could even switch to a stronger one, rotary embeddings, if need be)
let me offer a way to turn off the sparse linear attention i have at every layer, and we can debug to see if that is the culprit
i have also switched from resnet blocks to the newer convnext https://arxiv.org/abs/2201.03545 , but can always bring back resnets if somehow it isn't suitable for generative work
@lucidrains it looks like they are mixing random frames at the end of a video seq -- "To implement this joint training, we concatenate random independent image frames to the end of each video sampled from the dataset, and we mask the attention in the temporal attention blocks to prevent mixing information across video frames and each individual image frame. We choose these random independent images from random videos within the same dataset;"
i see, this must be to counter overfitting, as most videos have very similar frames. i'll think about how to build this into the trainer
@mrkulk https://github.com/lucidrains/video-diffusion-pytorch/commit/233c1d695e1a80267dac7ddd64d1d8acab17b1f6#diff-4ff1a95f5e6b9add82d0e523fd2d858ca38e67b393ea87c2ae88a8b14a0fbb1cR305 this should allow you to turn off the linear attention, in case that is causing the divergence
@mrkulk ok, i've brought back the old resnet blocks in version 0.3.1, and started a run on my own moving mnist dataset
perhaps jumping onwards to convnext wasn't the greatest idea :sweat_smile:
@lucidrains ok sounds good. will wait for your ping to do some more debugging/testing once you take a stab. btw before 0.3.1 I got errors on forward due to SpatialLinearAttention
(goes away if you turn it off though). you may have probably already run into it
@mrkulk yup, that attention error should be fixed! here is the experiment https://wandb.ai/lucidrains/video-diffusion-redo/reports/moving-mnist-video-diffusion--VmlldzoxOTQ3OTM0?accessToken=6m0nlx9992n6pind2j3113v03tbsps52v0rtkyw4jqotpgz99ziwlx2zsh6remna also, thanks for the sponsorship! :heart:
@lucidrains awesome! the samples, even at the beginning look qualitatively different. loss seems to be steadily going down. wonder why its blobby
@lucidrains awesome! the samples, even at the beginning look qualitatively different. loss seems to be steadily going down. wonder why its blobby
haha, i actually compared it to my previous convnext run and it looks about the same
this moving mnist dataset actually comes from https://github.com/RoshanRane/segmentation-moving-MNIST (minus the salt and pepper background noise)
i'll retry a run tonight with the new focus-on-present probabilities hyperparameter
@lucidrains i was using a different trainer but if i use your's and this below code then it seems more reasonable (only 1k iters). The moving mnist is from: https://www.cs.toronto.edu/~nitish/unsupervised_video/
from moviepy.editor import ImageSequenceClip
def moving_mnist_gif_creator():
root = os.path.expanduser('~/datasets/mnist_test_seq.npy')
out_root = os.path.expanduser('~/datasets/moving_mnist')
data = np.load(root)
for ii in range(data.shape[1]):
clip = ImageSequenceClip(list(data[:10, ii][..., None]), fps=20)
name = str(ii) + '.gif'
clip.write_gif(Path(out_root) / name, fps=60)
@mrkulk It turns out I had the wrong settings
You are right, the old resnet blocks work much better than convnext blocks, and I will likely remove them today so not to confuse researchers
14k for moving mnist - it has already figured out some of the background objects, though have yet to segment the digits, but is already training way faster than a pure attention-based solution
i've also added rotary positional embeddings to the temporal attention. that should definitely help no question
for reference, using convnext blocks at around 13k
very interesting! can give this a shot today. didn't expect convnext to not work as well. Another interesting and somewhat related point --the transframer (https://arxiv.org/pdf/2203.09494.pdf) paper used NF-ResNet block within a 3d Unet for video generation. but still need to mentally consolidate and think about these variations
@mrkulk cool! was not aware of transframer - will need to queue that up for reading
second day of training, it has yet again improved (but still has not segmented the digits yet as the attention based one has, but it is still early in training)
there are these weird flickering artifacts as well, from frame to frame, and i'll have to see if they persist once i move the training onto some remote compute that came my way
@mrkulk cool! was not aware of transframer - will need to queue that up for reading
second day of training, it has yet again improved (but still has not segmented the digits yet as the attention based one has, but it is still early in training)
there are these weird flickering artifacts as well, from frame to frame, and i'll have to see if they persist once i move the training onto some remote compute that came my way
it seems to have stabilized on the background and filling in higher frequency details -- quite promising!
how many videos are you training this on and what batch size?
ran a test on 10k videos on the same dataset but smaller batch size (on a V100). This is without sparse linear attn. I think the trend is similar? :
https://wandb.ai/csmai/multimodal-video-generation/reports/moving-mnist-video_diffusion_model--VmlldzoxOTYwNjY0?accessToken=q26z0simmgj0zqjc61kfsixezrzvo3clidjkuujwl5ttrpivpoj3c5idj886qh4c
Seems like the digits will come towards the very end but it interesting that pure attn gets it. How many global steps are you in compared to the pure attn based one?