quality drop with non 512x512 width and height (非512x512大小的输出质量变差)
If I modified -W and -H to non-512x512, such as (384,384), (1024, 1024), (256, 256), the lip motion is damaged in different degrees. The most severe setting is under 1024x1024, whole face motion is destroyed. 我在infer_audio2vid_acc.py中,尝试把-W和-H改成256,256,或384,384,或1024,1024,都出现了不同程度的唇动消失问题。最严重的是1024x1024的,已经面目全非了。
-W 384 -H 384: 384x384: https://github.com/user-attachments/assets/b4e85c38-760a-4f10-b87a-826dd4c774d8
-W 256 -H 256: 256x256: https://github.com/user-attachments/assets/137c934b-0c62-4a89-81f1-70b15a7b54c3
-W 1024 -H 1024: 1024x1024: https://github.com/user-attachments/assets/2d02e707-51c4-4eb5-8d4b-9a0575774021
The model is trained on 512x512 dataset.
You can upscale after the generation is complete
the what does the width and the height function do? are they there for cropping the video before/after generation?
the what does the width and the height function do? are they there for cropping the video before/after generation?
Unfortunately, I'm not a developer so can't explain any further. Someone who understand can explain.