Talking-Face_PC-AVS why embedding the audio features

why embedding the audio features

Open e4s2022 opened this issue 2 years ago • 0 comments

Hi, thanks for sharing this great work!

I understand the main pipeline, i.e., encoding the speech content features, id features, and pose features respectively then feeding them to the generator for the driven results. But I am a little bit confused after reading the inference code. https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS/blob/23585e281360872fe2d1e1eec8ff49176ea0183d/models/av_model.py#L473-L484

As can be seen, the mel-spectrogram is encoded by the audio encoder first in Line 473 and is ready to be fused with the pose feature in Line 483. However, in the merge_mouthpose() function: https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS/blob/23585e281360872fe2d1e1eec8ff49176ea0183d/models/av_model.py#L454-L461

I found the audio features are further embedded, what is the intuition behind that? In my view, the netE.mouth_embed would be used to embed the mouth features for the video but NOT for the audio. If anything is wrong, please correct me. Thanks in advance.

Jun 29 '22 06:06 e4s2022

Talking-Face_PC-AVS Talking-Face_PC-AVS copied to clipboard

why embedding the audio features

Talking-Face_PC-AVS
Talking-Face_PC-AVS copied to clipboard