MuseTalk
MuseTalk copied to clipboard
About the relationship between Whisper vs pretrained UNet SDv1.4
In this work, the author adopted Whisper-tiny (d_model=384) to extract audio feature, while training UNet from scratch. I guess the reason behind training from scratch instead of loading pretrained SDv1.4 because pretrained model has cross_attention_dim=768 and feature dim of Whisper-tiny is 384. Hence, I wonder why don't use Whisper-small (d_model=768) which has the same dimension as pretrained SDv1.4, then we can utilize the strong pretrained model from SDv1.4