Notable gap between image stylization and video stylization based on animatediff and controlnet
Hi, thanks for your great work! I've been exploring video stylization with animatediff yet and I noticed you might have already tried these out. I've found that the results of generated video significantly differ from image-to-image stylization, the image stylization is clear and aligns well with the style model, but with adv3, there's a notable gap. Have you ever experienced this? Additionally, when using adv3, the output becomes less smooth with more than 16 frames, resulting in flickering. Could I try your trained 48-frame model?
Thanks a lot!
I'm using adv3 too, it is much more expressive then mine. 48 frame model gives more stability and removes flickering. I trained my models on 4000+ samples for 180 000 steps and my model lost possibility to imagine something.
I found out that using lora is brings quite good result. I'm training lora on all attention layers (unet and mm) on one video for 3 frames on resolution 1024x576 with adding info about frame position into embeddings for 100 epochs (for example video contain 30 frames, so i train for 1000 steps). Then i'm using it with weight something near 0.3 and it gives good result on stylisation task and removes flickering as well.
Thanks a lot for your suggestion ! I'll try to train a lora on all attentions layers.