AnimateDiff
AnimateDiff copied to clipboard
Training issues and learning rates
Hi! Thanks for releasing the models + the training code. That's a massive contribution !
I've tried to train the model either by finetuning the already released model or training from scratch but the result is always the same : the model starts collapsing and the frames produced during training are only noise.
Here are what I tested to prevent that :
- Using 30 videos or 3500 videos
- Using different batch size (I started at BS1 because I don't have enough VRAM to go higher with 24GB) : --- Using gradient accumulation steps of 4 with BS1 : No really change --- Using BS4 + gradient accumulation steps of 1 with Gradient checkpointing : Strangely the model didn't seem to learn ANYTHING when using gradient checkpointing
- The only thing that got any result was to really reduce the learning rate : -- LR 1e-4 : Model collapse after only 40 steps -- LR 1e-5 : Model collapse after around 100 steps -- LR 1e-7 : Model collapse after 10K steps but it didn't learn anything
I haven't tried using the original dataset of videos, that would be my next test. Can it be because of the videos I used ? Something with FPS or anything ?
Has anyone else managed to train from scratch or finetune ? If yes, what LR did you use ? And what other params have you changed from the training.yaml file ?
Thanks
same issue, have you handled this problem?
No, I haven't tried again
i have updated xformers from 0.0.16 to 0.0.17, then it works, maybe you can try this
Hi! Thanks for releasing the models + the training code. That's a massive contribution !
I've tried to train the model either by finetuning the already released model or training from scratch but the result is always the same : the model starts collapsing and the frames produced during training are only noise.
Here are what I tested to prevent that :
- Using 30 videos or 3500 videos
- Using different batch size (I started at BS1 because I don't have enough VRAM to go higher with 24GB) : --- Using gradient accumulation steps of 4 with BS1 : No really change --- Using BS4 + gradient accumulation steps of 1 with Gradient checkpointing : Strangely the model didn't seem to learn ANYTHING when using gradient checkpointing
- The only thing that got any result was to really reduce the learning rate : -- LR 1e-4 : Model collapse after only 40 steps -- LR 1e-5 : Model collapse after around 100 steps -- LR 1e-7 : Model collapse after 10K steps but it didn't learn anything
I haven't tried using the original dataset of videos, that would be my next test. Can it be because of the videos I used ? Something with FPS or anything ?
Has anyone else managed to train from scratch or finetune ? If yes, what LR did you use ? And what other params have you changed from the training.yaml file ?
Thanks
hi , whats your videos look like? same motion?
Any update? I got the same issue. I'm not sure if the collapse is related to the training datasets. I used the tiktok videos to train the motion module from scratch without modifying any hyperparams in the training config file, but got noisy video after about 30 training steps. My xformer's version is 0.0.20.
Is it related to this issue?