huyduong7101
huyduong7101
### Describe the bug As far as I know, UNetMotionModel is adopted for AnimateDiff. Hence, I look into the original implementation of AnimateDiff, it is noticed that they use cross-attention...
May I ask why the authors adopt pretrained weight VAE from https://huggingface.co/stabilityai/sd-vae-ft-mse, instead of that from https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/vae?
In the scope of human-related video generation, there are two main and emergent problems, namely, Talking Face Generation (TFG) and Human Animation Generation (HAG). The discrepancy between those problems is...
In this work, the author adopted Whisper-tiny (d_model=384) to extract audio feature, while training UNet from scratch. I guess the reason behind training from scratch instead of loading pretrained SDv1.4...
"Bbox shift" has a significant impact on the output. Hence, does anyone try to use "bbox shift" as augmentation in training?