Deep-Live-Cam icon indicating copy to clipboard operation
Deep-Live-Cam copied to clipboard

can not use mac M4 GPU,only use cpu , 1.8fps only

Open yuanmouren1hao opened this issue 6 months ago • 8 comments

yuanmouren1hao avatar Jun 20 '25 07:06 yuanmouren1hao

Image

yuanmouren1hao avatar Jun 20 '25 08:06 yuanmouren1hao

Can you try to use my repo and see: https://github.com/manumaan/Deep-Live-Cam-Mac

I have updated core logic to utilze the mac metal gpu.

manumaan avatar Jun 24 '25 05:06 manumaan

m1 pro the same 1 fps

hdd99009 avatar Jun 26 '25 09:06 hdd99009

Can you try to use my repo and see: https://github.com/manumaan/Deep-Live-Cam-Mac

I have updated core logic to utilze the mac metal gpu.

I tried this .from 1 fps to 2/3fps ,it is 100% faster.but still no gpu

hdd99009 avatar Jun 27 '25 07:06 hdd99009

The reason is that onnx does not support Metal Performance Shaders (MPS): https://github.com/microsoft/onnxruntime/issues/21271. It supports CoreML which for this projects ends up running on the Apple Neural Engine (ANE) and the CPU. To utilize the GPU, you'll want to run the various models using MPS, e.g., with Pytorch:

Instructions for modifying DLC to run with MPS using Pytorch
  1. Create a fresh clone of the project git clone https://github.com/hacksider/Deep-Live-Cam.git && cd Deep-Live-Cam
  2. Run brew install tcl-tk && brew install pyenv
  3. Install Python 3.10.16
env \                                 PATH="$(brew --prefix tcl-tk)/bin:$PATH" \
  LDFLAGS="-L$(brew --prefix tcl-tk)/lib" \
  CPPFLAGS="-I$(brew --prefix tcl-tk)/include" \
  PKG_CONFIG_PATH="$(brew --prefix tcl-tk)/lib/pkgconfig" \
  CFLAGS="-I$(brew --prefix tcl-tk)/include" \
  PYTHON_BUILD_HOMEBREW_OPENSSL_FORMULA="openssl@3" \ 
  PYTHON_CONFIGURE_OPTS="--with-tcltk-includes='-I$(brew --prefix tcl-tk)/include' --with-tcltk-libs='-L$(brew --prefix tcl-tk)/lib -ltcl8.6 -ltk8.6'" \      
  pyenv install 3.10.16
pyenv shell 3.10.16
  1. Create a virtual environment python -m venv venv && . venv/bin/activate
  2. Install dependencies pip install numpy>=1.23.5,<2 typing-extensions>=4.8.0 opencv-python==4.10.0.84 cv2_enumerate_cameras==1.1.15 onnx==1.18.0 insightface==0.7.3 psutil==5.9.8 tk==0.1.0 customtkinter==5.2.2 pillow==11.1.0 torch==2.8.0 torchvision==0.23.0 onnxruntime==1.22.1 tensorflow==2.20.0 opennsfw2==0.10.2 protobuf==6.32.0 onnx2torch==1.5.15
  3. Setup gpfgan and basicsrs:
pip install gfpgan
python -c "import sys; import fileinput; [(sys.stdout.write('from torchvision.transforms.functional import rgb_to_grayscale\n') if line == 'from torchvision.transforms.functional_tensor import rgb_to_grayscale\n' else sys.stdout.write(line)) for line in fileinput.FileInput('./venv/lib/python3.10/site-packages/basicsr/data/degradations.py', inplace=True, backup='.bak')];"
  1. [optional] Run the app, note the fps and use activity monitor to note that no GPU is being utilized: python run.py --execution-provider coreml
  2. Enable GPU for the inswapper_128 model by replacing the contents of venv/lib/python3.10/site-packages/insightface/model_zoo/inswapper.py with:
import time
import numpy as np
import onnxruntime
import cv2
import onnx
from onnx import numpy_helper
from ..utils import face_align
import torch
from onnx2torch import convert


class INSwapper():
    def __init__(self, model_file=None, session=None):
        self.model_file = model_file
        self.session = session
        model = onnx.load(self.model_file)
        graph = model.graph
        self.emap = numpy_helper.to_array(graph.initializer[-1])
        self.input_mean = 0.0
        self.input_std = 255.0
        #print('input mean and std:', model_file, self.input_mean, self.input_std)
        if self.session is None:
            self.session = onnxruntime.InferenceSession(self.model_file, None)
        inputs = self.session.get_inputs()
        self.input_names = []
        for inp in inputs:
            self.input_names.append(inp.name)
        outputs = self.session.get_outputs()
        output_names = []
        for out in outputs:
            output_names.append(out.name)
        self.output_names = output_names
        assert len(self.output_names)==1
        output_shape = outputs[0].shape
        input_cfg = inputs[0]
        input_shape = input_cfg.shape
        self.input_shape = input_shape
        print('inswapper-shape:', self.input_shape)
        self.input_size = tuple(input_shape[2:4][::-1])

        self.torch = convert(self.model_file).to("mps")

    def forward(self, img, latent):
        img = (img - self.input_mean) / self.input_std
        # pred = self.session.run(self.output_names, {self.input_names[0]: img, self.input_names[1]: latent})[0]
        pred = self.torch(torch.from_numpy(img).to('mps'), torch.from_numpy(latent).to('mps')).detach().cpu().numpy()
        return pred

    def get(self, img, target_face, source_face, paste_back=True):
        aimg, M = face_align.norm_crop2(img, target_face.kps, self.input_size[0])
        blob = cv2.dnn.blobFromImage(aimg, 1.0 / self.input_std, self.input_size,
                                      (self.input_mean, self.input_mean, self.input_mean), swapRB=True)
        latent = source_face.normed_embedding.reshape((1,-1))
        latent = np.dot(latent, self.emap)
        latent /= np.linalg.norm(latent)
        # pred = self.session.run(self.output_names, {self.input_names[0]: blob, self.input_names[1]: latent})[0]
        # print(pred.shape, self.input_names, self.output_names)
        pred = self.torch(torch.from_numpy(blob).to('mps'), torch.from_numpy(latent).to('mps')).detach().cpu().numpy()
        img_fake = pred.transpose((0,2,3,1))[0]
        bgr_fake = np.clip(255 * img_fake, 0, 255).astype(np.uint8)[:,:,::-1]
        if not paste_back:
            return bgr_fake, M
        else:
            target_img = img
            fake_diff = bgr_fake.astype(np.float32) - aimg.astype(np.float32)
            fake_diff = np.abs(fake_diff).mean(axis=2)
            fake_diff[:2,:] = 0
            fake_diff[-2:,:] = 0
            fake_diff[:,:2] = 0
            fake_diff[:,-2:] = 0
            IM = cv2.invertAffineTransform(M)
            img_white = np.full((aimg.shape[0],aimg.shape[1]), 255, dtype=np.float32)
            bgr_fake = cv2.warpAffine(bgr_fake, IM, (target_img.shape[1], target_img.shape[0]), borderValue=0.0)
            img_white = cv2.warpAffine(img_white, IM, (target_img.shape[1], target_img.shape[0]), borderValue=0.0)
            fake_diff = cv2.warpAffine(fake_diff, IM, (target_img.shape[1], target_img.shape[0]), borderValue=0.0)
            img_white[img_white>20] = 255
            fthresh = 10
            fake_diff[fake_diff<fthresh] = 0
            fake_diff[fake_diff>=fthresh] = 255
            img_mask = img_white
            mask_h_inds, mask_w_inds = np.where(img_mask==255)
            mask_h = np.max(mask_h_inds) - np.min(mask_h_inds)
            mask_w = np.max(mask_w_inds) - np.min(mask_w_inds)
            mask_size = int(np.sqrt(mask_h*mask_w))
            k = max(mask_size//10, 10)
            #k = max(mask_size//20, 6)
            #k = 6
            kernel = np.ones((k,k),np.uint8)
            img_mask = cv2.erode(img_mask,kernel,iterations = 1)
            kernel = np.ones((2,2),np.uint8)
            fake_diff = cv2.dilate(fake_diff,kernel,iterations = 1)
            k = max(mask_size//20, 5)
            #k = 3
            #k = 3
            kernel_size = (k, k)
            blur_size = tuple(2*i+1 for i in kernel_size)
            img_mask = cv2.GaussianBlur(img_mask, blur_size, 0)
            k = 5
            kernel_size = (k, k)
            blur_size = tuple(2*i+1 for i in kernel_size)
            fake_diff = cv2.GaussianBlur(fake_diff, blur_size, 0)
            img_mask /= 255
            fake_diff /= 255
            #img_mask = fake_diff
            img_mask = np.reshape(img_mask, [img_mask.shape[0],img_mask.shape[1],1])
            fake_merged = img_mask * bgr_fake + (1-img_mask) * target_img.astype(np.float32)
            fake_merged = fake_merged.astype(np.uint8)
            return fake_merged
  1. Run the app with python run.py --execution-provider coreml and check the activity monitor to see that some of the GPU is now being used.

Similar edits can be used to run the other models on the GPU but the gain is limited due to the overhead of converting between CPU and GPU.

Using this, I was able to increase from 0.7 FPS to 5 FPS on my M1 Pro (66% GPU utilization). Of course it is not a complete solution. For optimal performance, the entire data flow should be rewritten to keep everything on the GPU and/or to take better advantage of parallelism (running the small models on the CPU while running the larger one on the GPU and optimizing the graph of the inswapper_128). These are more substantial changes. An additional avenue to explore is quantization. To start, you can play around with replacing .to("mps") with .to("mps").half() to run in half precision.

SanderGi avatar Sep 09 '25 13:09 SanderGi

Turns out the onnxruntime supports specifying CoreML options. This means we can use the GPU directly with ONNX (no PyTorch conversion).

Instructions for modifying DLC to run on Apple Silicon GPU using CoreML

Follow the instructions in https://github.com/hacksider/Deep-Live-Cam/issues/1495, then open modules/processors/frame/face_swapper.py and replace the get_face_swapper function with the following:

def get_face_swapper() -> Any:
    global FACE_SWAPPER

    with THREAD_LOCK:
        if FACE_SWAPPER is None:
            model_name = "inswapper_128.onnx"
            if "CUDAExecutionProvider" in modules.globals.execution_providers:
                model_name = "inswapper_128_fp16.onnx"
            model_path = os.path.join(models_dir, model_name)
            FACE_SWAPPER = insightface.model_zoo.get_model(
                model_path,
                # providers=modules.globals.execution_providers,
                providers=[
                    (
                        (
                            "CoreMLExecutionProvider",
                            {
                                "ModelFormat": "MLProgram",
                                "MLComputeUnits": "CPUAndGPU",
                                "SpecializationStrategy": "FastPrediction",
                                "AllowLowPrecisionAccumulationOnGPU": 1,
                            },
                        )
                        if p == "CoreMLExecutionProvider"
                        else p
                    )
                    for p in modules.globals.execution_providers
                ],
            )
    return FACE_SWAPPER

Then run with python run.py --execution-provider coreml --live-mirror --execution-threads 4 (adjust threads as appropriate based on your device).

Combined with the Pipelining approach in https://github.com/hacksider/Deep-Live-Cam/issues/1495, I was able to go from 1 FPS w/ 600ms latency to 12 FPS w/ 300ms latency on my M1. On an M4 Pro, this translates to an improvement from 7 FPS w/ 300ms latency (unmodified), to 10 FPS w/ 180ms latency (PyTorch GPU), to 11 FPS w/ 160ms latency (CoreML GPU), to 17 FPS w/ 300ms latency (pipelining), to 18 FPS w/ 150ms latency (pipelining + PyTorch GPU), to 30 FPS w/ 120ms latency (pipelining + CoreML GPU).

Credit to whisper.cpp's source code for making me aware of this option.

SanderGi avatar Sep 16 '25 00:09 SanderGi

Also seeing very poor performance on macOS here. Would love to see some optimizations merged in. Has anyone raised a pull request?

RSully avatar Oct 31 '25 02:10 RSully

@RSully, I believe the optimizations in this thread just got mostly merged in through this commit https://github.com/hacksider/Deep-Live-Cam/commit/b82fdc3f31a91e9b92f5344057048ffecacedb85. There are more optimizations in the works in this thread https://github.com/hacksider/Deep-Live-Cam/discussions/1553 but I personally will not have time to polish up a PR until some time in December. Help is definitely welcome

SanderGi avatar Oct 31 '25 04:10 SanderGi