Talking-Face_PC-AVS
Talking-Face_PC-AVS copied to clipboard
why embedding the audio features
Hi, thanks for sharing this great work!
I understand the main pipeline, i.e., encoding the speech content features, id features, and pose features respectively then feeding them to the generator for the driven results. But I am a little bit confused after reading the inference code. https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS/blob/23585e281360872fe2d1e1eec8ff49176ea0183d/models/av_model.py#L473-L484
As can be seen, the mel-spectrogram is encoded by the audio encoder first in Line 473 and is ready to be fused with the pose feature in Line 483. However, in the merge_mouthpose()
function:
https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS/blob/23585e281360872fe2d1e1eec8ff49176ea0183d/models/av_model.py#L454-L461
I found the audio features are further embedded, what is the intuition behind that? In my view, the netE.mouth_embed
would be used to embed the mouth features for the video but NOT for the audio. If anything is wrong, please correct me. Thanks in advance.