MuseTalk icon indicating copy to clipboard operation
MuseTalk copied to clipboard

About the relationship between Whisper vs pretrained UNet SDv1.4

Open huyduong7101 opened this issue 6 months ago • 2 comments

In this work, the author adopted Whisper-tiny (d_model=384) to extract audio feature, while training UNet from scratch. I guess the reason behind training from scratch instead of loading pretrained SDv1.4 because pretrained model has cross_attention_dim=768 and feature dim of Whisper-tiny is 384. Hence, I wonder why don't use Whisper-small (d_model=768) which has the same dimension as pretrained SDv1.4, then we can utilize the strong pretrained model from SDv1.4

huyduong7101 avatar Aug 07 '24 07:08 huyduong7101