echomimic quality drop with non 512x512 width and height (非512x512大小的输出质量变差）

If I modified -W and -H to non-512x512, such as (384,384), (1024, 1024), (256, 256), the lip motion is damaged in different degrees. The most severe setting is under 1024x1024, whole face motion is destroyed. 我在infer_audio2vid_acc.py中，尝试把-W和-H改成256，256，或384，384，或1024，1024，都出现了不同程度的唇动消失问题。最严重的是1024x1024的，已经面目全非了。

-W 384 -H 384: 384x384: https://github.com/user-attachments/assets/b4e85c38-760a-4f10-b87a-826dd4c774d8

-W 256 -H 256: 256x256: https://github.com/user-attachments/assets/137c934b-0c62-4a89-81f1-70b15a7b54c3

-W 1024 -H 1024: 1024x1024: https://github.com/user-attachments/assets/2d02e707-51c4-4eb5-8d4b-9a0575774021

Oct 03 '24 03:10 RockySong

The model is trained on 512x512 dataset.

You can upscale after the generation is complete

Oct 03 '24 18:10 nitinmukesh

the what does the width and the height function do? are they there for cropping the video before/after generation?

Oct 26 '24 12:10 TanvirHafiz

the what does the width and the height function do? are they there for cropping the video before/after generation?

Unfortunately, I'm not a developer so can't explain any further. Someone who understand can explain.

Oct 26 '24 13:10 nitinmukesh