Wav2Lip [Help needed!] There's clear box region around the mouth when using personal video

Dear author,

Thanks for sharing the excellent work.

I found that when using my personal video, there is a clear box region around the mouth in the output result, see as below:

What could be the reason of this, and could you please give me some instruction on how to solve it?

Many thanks for the help.

Best.

Sep 19 '22 03:09 liuquande

have the same problem here

Oct 03 '22 12:10 alexanderj1988

The problem is that model has 96x96 resolution. So it downscales face square and than upscales to fit your source video. There's no solution. You can only train hi-res model )

Nov 05 '22 11:11 NikitaKononov

@liuquande I train the model use AVSpeech and meet the same problem. Do you use the pretrained model?

Jan 14 '23 03:01 Curisan

The reason this is happening is because the detected face image is resized to 96x96 before being inputted to the network, and the outputted lip-synced face, which is also 96x96, is being resized back to the original dimensions of the face. If the original face is larger than 96x96, it might cause those blurry edges due to up-sampling the lip-sync (probably by interpolation). The bigger the difference in dimensions between the original face and the 96x96 input dimensions, the more conspicuous those edges will be. I see two straightforward ways to deal with it:

Downsample input frames using the --resize_factor argument. While this will reduce video resolution, it will reduce face dimensions and mitigate the square effect.
(requires modifying code) Calculate a face mask for each frame so you'll know to paste only the outputted lip-synced face and not its surrounding. There are many libraries that can do that in different ways such as MediaPipe and face-parsing.

I implemented (2) and it completely eliminated the square while keeping the lip-synced face intact.

May 21 '23 11:05 YinonDouchanClarity

The reason this is happening is because the detected face image is resized to 96x96 before being inputted to the network, and the outputted lip-synced face, which is also 96x96, is being resized back to the original dimensions of the face. If the original face is larger than 96x96, it might cause those blurry edges due to up-sampling the lip-sync (probably by interpolation). The bigger the difference in dimensions between the original face and the 96x96 input dimensions, the more conspicuous those edges will be. I see two straightforward ways to deal with it:

Downsample input frames using the --resize_factor argument. While this will reduce video resolution, it will reduce face dimensions and mitigate the square effect.

(requires modifying code) Calculate a face mask for each frame so you'll know to paste only the outputted lip-synced face and not its surrounding. There are many libraries that can do that in different ways such as MediaPipe and face-parsing.

I implemented (2) and it completely eliminated the square while keeping the lip-synced face intact.

Could you please share the modified code? I'm a beginner, and this issue has been bothering me for a long time. Thank you very much!

Jun 07 '23 17:06 stevin-dong

It's a bit problematic for me to send the code. However, I can send the modifications. Take them as rough guidelines rather than exact. Read about how to use MediaPipe.

Install MediaPipe:

pip install mediapipe==0.10.0

Download face landmarker model and put it in the "weights" folder:

wget -O weights/face_landmarker_v2_with_blendshapes.task -q https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task

Add MediaPipe imports:

from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import mediapipe as mp

Add face mask args:

	parser.add_argument('--face_landmarks_detector_path', default='weights/face_landmarker_v2_with_blendshapes.task',
						type=str, help='Path to face landmarks detector')
	parser.add_argument('--with_face_mask', action='store_true',
						help='Blend output into original frame using a face mask rather than directly blending the face box. This prevents a lower resolution square artifact around lower face')

In main() in the loop starting with "for p, f, c in ..." Modify The line f[y1:y2, x1:x2] = p to:

				mask = face_mask_from_image(p, face_landmarks_detector)
				f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]
			else:
				f[y1:y2, x1:x2] = p

Add face_mask_from_image function:

def face_mask_from_image(image, face_landmarks_detector):
	"""
	Calculate face mask from image. This is done by

	Args:
		image: numpy array of an image
		face_landmarks_detector: mediapipa face landmarks detector
	Returns:
		A uint8 numpy array with the same height and width of the input image, containing a binary mask of the face in the image
	"""
	# initialize mask
	mask = np.zeros((image.shape[0], image.shape[1]), dtype=np.uint8)

	# detect face landmarks
	mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
	detection = face_landmarks_detector.detect(mp_image)

	if len(detection.face_landmarks) == 0:
		# no face detected - set mask to all of the image
		mask[:] = 1
		return mask

	# extract landmarks coordinates
	face_coords = np.array([[lm.x * image.shape[1], lm.y * image.shape[0]] for lm in detection.face_landmarks[0]])

	# calculate convex hull from face coordinates
	convex_hull = cv2.convexHull(face_coords.astype(np.float32))

	# apply convex hull to mask
	return cv2.fillPoly(mask, pts=[convex_hull.squeeze().astype(np.int32)], color=1)

Jun 07 '23 17:06 YinonDouchanClarity

It's a bit problematic for me to send the code. However, I can send the modifications. Take them as rough guidelines rather than exact. Read about how to use MediaPipe.

Install MediaPipe:

pip install mediapipe==0.10.0

Download face landmarker model and put it in the "weights" folder:

wget -O weights/face_landmarker_v2_with_blendshapes.task -q https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task

Add MediaPipe imports:

from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import mediapipe as mp

Add face mask args:

	parser.add_argument('--face_landmarks_detector_path', default='weights/face_landmarker_v2_with_blendshapes.task',
						type=str, help='Path to face landmarks detector')
	parser.add_argument('--with_face_mask', action='store_true',
						help='Blend output into original frame using a face mask rather than directly blending the face box. This prevents a lower resolution square artifact around lower face')

In main() in the loop starting with "for p, f, c in ..." Modify The line f[y1:y2, x1:x2] = p to:

				mask = face_mask_from_image(p, face_landmarks_detector)
				f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]
			else:
				f[y1:y2, x1:x2] = p

Add face_mask_from_image function:

def face_mask_from_image(image, face_landmarks_detector):
	"""
	Calculate face mask from image. This is done by

	Args:
		image: numpy array of an image
		face_landmarks_detector: mediapipa face landmarks detector
	Returns:
		A uint8 numpy array with the same height and width of the input image, containing a binary mask of the face in the image
	"""
	# initialize mask
	mask = np.zeros((image.shape[0], image.shape[1]), dtype=np.uint8)

	# detect face landmarks
	mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
	detection = face_landmarks_detector.detect(mp_image)

	if len(detection.face_landmarks) == 0:
		# no face detected - set mask to all of the image
		mask[:] = 1
		return mask

	# extract landmarks coordinates
	face_coords = np.array([[lm.x * image.shape[1], lm.y * image.shape[0]] for lm in detection.face_landmarks[0]])

	# calculate convex hull from face coordinates
	convex_hull = cv2.convexHull(face_coords.astype(np.float32))

	# apply convex hull to mask
	return cv2.fillPoly(mask, pts=[convex_hull.squeeze().astype(np.int32)], color=1)

Thank you very much for your help！

Jun 08 '23 04:06 stevin-dong

It's a bit problematic for me to send the code. However, I can send the modifications. Take them as rough guidelines rather than exact. Read about how to use MediaPipe.

Install MediaPipe:

pip install mediapipe==0.10.0

Download face landmarker model and put it in the "weights" folder:

wget -O weights/face_landmarker_v2_with_blendshapes.task -q https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task

Add MediaPipe imports:

from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import mediapipe as mp

Add face mask args:

	parser.add_argument('--face_landmarks_detector_path', default='weights/face_landmarker_v2_with_blendshapes.task',
						type=str, help='Path to face landmarks detector')
	parser.add_argument('--with_face_mask', action='store_true',
						help='Blend output into original frame using a face mask rather than directly blending the face box. This prevents a lower resolution square artifact around lower face')

In main() in the loop starting with "for p, f, c in ..." Modify The line f[y1:y2, x1:x2] = p to:

				mask = face_mask_from_image(p, face_landmarks_detector)
				f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]
			else:
				f[y1:y2, x1:x2] = p

Add face_mask_from_image function:

def face_mask_from_image(image, face_landmarks_detector):
	"""
	Calculate face mask from image. This is done by

	Args:
		image: numpy array of an image
		face_landmarks_detector: mediapipa face landmarks detector
	Returns:
		A uint8 numpy array with the same height and width of the input image, containing a binary mask of the face in the image
	"""
	# initialize mask
	mask = np.zeros((image.shape[0], image.shape[1]), dtype=np.uint8)

	# detect face landmarks
	mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
	detection = face_landmarks_detector.detect(mp_image)

	if len(detection.face_landmarks) == 0:
		# no face detected - set mask to all of the image
		mask[:] = 1
		return mask

	# extract landmarks coordinates
	face_coords = np.array([[lm.x * image.shape[1], lm.y * image.shape[0]] for lm in detection.face_landmarks[0]])

	# calculate convex hull from face coordinates
	convex_hull = cv2.convexHull(face_coords.astype(np.float32))

	# apply convex hull to mask
	return cv2.fillPoly(mask, pts=[convex_hull.squeeze().astype(np.int32)], color=1)

I try these code in inferrence.py,but return with error: face_landmarks_detector has not been defined,can you show how to create this object with mediapipe?

Jun 16 '23 05:06 liumaokun2022

我发送代码有点问题。不过，我可以发送修改内容。将它们作为粗略的指导方针而不是精确的指导方针。了解如何使用 MediaPipe。

安装 MediaPipe：

pip install mediapipe==0.10.0

下载人脸地标模型并将其放入“weights”文件夹中：

wget -O weights/face_landmarker_v2_with_blendshapes.task -q https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task

添加 MediaPipe 导入：

from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import mediapipe as mp

添加面罩参数：

	parser.add_argument('--face_landmarks_detector_path', default='weights/face_landmarker_v2_with_blendshapes.task',
						type=str, help='Path to face landmarks detector')
	parser.add_argument('--with_face_mask', action='store_true',
						help='Blend output into original frame using a face mask rather than directly blending the face box. This prevents a lower resolution square artifact around lower face')

在 main() 中以“for p, f, c in ...”开头的循环中将该行修改为f[y1:y2, x1:x2] = p：

				mask = face_mask_from_image(p, face_landmarks_detector)
				f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]
			else:
				f[y1:y2, x1:x2] = p

添加face_mask_from_image函数：

def face_mask_from_image(image, face_landmarks_detector):
	"""
	Calculate face mask from image. This is done by

	Args:
		image: numpy array of an image
		face_landmarks_detector: mediapipa face landmarks detector
	Returns:
		A uint8 numpy array with the same height and width of the input image, containing a binary mask of the face in the image
	"""
	# initialize mask
	mask = np.zeros((image.shape[0], image.shape[1]), dtype=np.uint8)

	# detect face landmarks
	mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
	detection = face_landmarks_detector.detect(mp_image)

	if len(detection.face_landmarks) == 0:
		# no face detected - set mask to all of the image
		mask[:] = 1
		return mask

	# extract landmarks coordinates
	face_coords = np.array([[lm.x * image.shape[1], lm.y * image.shape[0]] for lm in detection.face_landmarks[0]])

	# calculate convex hull from face coordinates
	convex_hull = cv2.convexHull(face_coords.astype(np.float32))

	# apply convex hull to mask
	return cv2.fillPoly(mask, pts=[convex_hull.squeeze().astype(np.int32)], color=1)

I try these code in inferrence.py,but return with error: face_landmarks_detector has not been defined,can you show how to create this object with mediapipe?

Jun 27 '23 06:06 sailorsale

It's a bit problematic for me to send the code. However, I can send the modifications. Take them as rough guidelines rather than exact. Read about how to use MediaPipe.

Install MediaPipe:

pip install mediapipe==0.10.0

Download face landmarker model and put it in the "weights" folder:

wget -O weights/face_landmarker_v2_with_blendshapes.task -q https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task

Add MediaPipe imports:

from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import mediapipe as mp

Add face mask args:

	parser.add_argument('--face_landmarks_detector_path', default='weights/face_landmarker_v2_with_blendshapes.task',
						type=str, help='Path to face landmarks detector')
	parser.add_argument('--with_face_mask', action='store_true',
						help='Blend output into original frame using a face mask rather than directly blending the face box. This prevents a lower resolution square artifact around lower face')

In main() in the loop starting with "for p, f, c in ..." Modify The line f[y1:y2, x1:x2] = p to:

				mask = face_mask_from_image(p, face_landmarks_detector)
				f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]
			else:
				f[y1:y2, x1:x2] = p

Add face_mask_from_image function:

def face_mask_from_image(image, face_landmarks_detector):
	"""
	Calculate face mask from image. This is done by

	Args:
		image: numpy array of an image
		face_landmarks_detector: mediapipa face landmarks detector
	Returns:
		A uint8 numpy array with the same height and width of the input image, containing a binary mask of the face in the image
	"""
	# initialize mask
	mask = np.zeros((image.shape[0], image.shape[1]), dtype=np.uint8)

	# detect face landmarks
	mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
	detection = face_landmarks_detector.detect(mp_image)

	if len(detection.face_landmarks) == 0:
		# no face detected - set mask to all of the image
		mask[:] = 1
		return mask

	# extract landmarks coordinates
	face_coords = np.array([[lm.x * image.shape[1], lm.y * image.shape[0]] for lm in detection.face_landmarks[0]])

	# calculate convex hull from face coordinates
	convex_hull = cv2.convexHull(face_coords.astype(np.float32))

	# apply convex hull to mask
	return cv2.fillPoly(mask, pts=[convex_hull.squeeze().astype(np.int32)], color=1)

                            mask = face_mask_from_image(p, face_landmarks_detector)
			f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]

else: f[y1:y2, x1:x2] = p This code is incomplete, since there is else, why is there no if in front?

Oct 17 '23 06:10 dizhenx

It's a bit problematic for me to send the code. However, I can send the modifications. Take them as rough guidelines rather than exact. Read about how to use MediaPipe.

Install MediaPipe:

pip install mediapipe==0.10.0

Download face landmarker model and put it in the "weights" folder:

wget -O weights/face_landmarker_v2_with_blendshapes.task -q https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task

Add MediaPipe imports:

from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import mediapipe as mp

Add face mask args:

	parser.add_argument('--face_landmarks_detector_path', default='weights/face_landmarker_v2_with_blendshapes.task',
						type=str, help='Path to face landmarks detector')
	parser.add_argument('--with_face_mask', action='store_true',
						help='Blend output into original frame using a face mask rather than directly blending the face box. This prevents a lower resolution square artifact around lower face')

In main() in the loop starting with "for p, f, c in ..." Modify The line f[y1:y2, x1:x2] = p to:

				mask = face_mask_from_image(p, face_landmarks_detector)
				f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]
			else:
				f[y1:y2, x1:x2] = p

Add face_mask_from_image function:

def face_mask_from_image(image, face_landmarks_detector):
	"""
	Calculate face mask from image. This is done by

	Args:
		image: numpy array of an image
		face_landmarks_detector: mediapipa face landmarks detector
	Returns:
		A uint8 numpy array with the same height and width of the input image, containing a binary mask of the face in the image
	"""
	# initialize mask
	mask = np.zeros((image.shape[0], image.shape[1]), dtype=np.uint8)

	# detect face landmarks
	mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
	detection = face_landmarks_detector.detect(mp_image)

	if len(detection.face_landmarks) == 0:
		# no face detected - set mask to all of the image
		mask[:] = 1
		return mask

	# extract landmarks coordinates
	face_coords = np.array([[lm.x * image.shape[1], lm.y * image.shape[0]] for lm in detection.face_landmarks[0]])

	# calculate convex hull from face coordinates
	convex_hull = cv2.convexHull(face_coords.astype(np.float32))

	# apply convex hull to mask
	return cv2.fillPoly(mask, pts=[convex_hull.squeeze().astype(np.int32)], color=1)

thanks,it's work!

Dec 03 '23 03:12 AIFSH

Hi. For those wondering that how to import face_landmarks_detector, here is the code: ` BaseOptions = mp.tasks.BaseOptions FaceLandmarker = mp.tasks.vision.FaceLandmarker FaceLandmarkerOptions = mp.tasks.vision.FaceLandmarkerOptions VisionRunningMode = mp.tasks.vision.RunningMode

options = FaceLandmarkerOptions( base_options=BaseOptions(model_asset_path=args.face_landmarks_detector_path), running_mode=VisionRunningMode.IMAGE)

ace_landmarks_detector = FaceLandmarker.create_from_options(options) ` But this won't necessarily solve the problem, because it will cause the face edge to be incompatible. The only solution may be to use high-res images to train.

Dec 06 '23 06:12 Crestina2001

Having a --resize_factor seems not to work, when my videos are already of 720P.

Dec 06 '23 06:12 Crestina2001

hit the same problem

Dec 13 '23 15:12 EricKong1985

Wav2Lip Wav2Lip copied to clipboard

[Help needed!] There's clear box region around the mouth when using personal video

Wav2Lip
Wav2Lip copied to clipboard