video-retalking Confusion about why only using the first frame of the video to crop the face region for the stablized image generation?

Confusion about why only using the first frame of the video to crop the face region for the stablized image generation?

Open NNNNAI opened this issue 1 year ago • 1 comments

Thanks for sharing your amazing work.

In the first step, you crop the face of each frame following the ffhq style. But I notice that you use the coordinates of the firrst frame contained face for all the frams cropping. The related code is quoted below. https://github.com/OpenTalker/video-retalking/blob/d32e8e58248255e2d243eeaf3cba545dbe505ca8/utils/ffhq_preprocess.py#L118-L138

Why not detect keypoints separately for each frame and then crop them. If the characters in the input video perform actions such as shaking their heads and their heads are not within the cropping area provided in the first frame, the result becomes very unsatisfactory. What is the reason for using only the coordinates of the first frame?

Thanks for your time, have a nice day~

Jan 08 '24 12:01 NNNNAI

Thanks for your attention. We did try that.

Detecting keypoints separately for each frame and then cropping them results in jitter in the cropped video. And this cropping strategy is different from the cropping method used when the rest of our network is trained, which will lead to performance degradation.

Jan 08 '24 13:01 kunncheng

video-retalking video-retalking copied to clipboard

Confusion about why only using the first frame of the video to crop the face region for the stablized image generation?

video-retalking
video-retalking copied to clipboard