Kun Cheng comments

Results 16 comments of


                                            Kun Cheng

Confusion about why only using the first frame of the video to crop the face region for the stablized image generation?

Thanks for your attention. We did try that. Detecting keypoints separately for each frame and then cropping them results in _jitter_ in the cropped video. And this cropping strategy is...

This tech is using ALL my RAM and none of MY GPU, how to make it use GPU instead??????

What is the output of `torch.cuda.is_available()` in your environment?

This tech is using ALL my RAM and none of MY GPU, how to make it use GPU instead??????

You should run `torch.cuda.is_available()` in Python instead of terminal. ``` import torch print(torch.cuda.is_available()) ```

This tech is using ALL my RAM and none of MY GPU, how to make it use GPU instead??????

That looks fine. What is the full name of your GPU? What is the gpu utilization when executing step6?

CUDA out of memory

You can try reducing these these two parameters: `--face_det_batch_size` and `--LNet_batch_size`. Or you can run the code in colab: [https://colab.research.google.com/github/vinthony/video-retalking/blob/main/quick_demo.ipynb](https://colab.research.google.com/github/vinthony/video-retalking/blob/main/quick_demo.ipynb)

CUDA out of memory

> I met the same error,how resolve it ? You can try smaller batch size: ``` python3 inference.py \ --face examples/face/1.mp4 \ --audio examples/audio/1.wav \ --outfile results/1_1.mp4 \ --face_det_batch_size 2...

中文的唇动效果似乎比较差？请问是否有对中文做优化？

训练所用的数据集为英文，可以泛化到不同语种，但性能有一定程度的下降。将LNet在合适的大规模中文视频数据集上重新训练或许能提升效果。

中文的唇动效果似乎比较差？请问是否有对中文做优化？

关于LNet的训练过程目前可以参考[Wav2Lip](https://github.com/Rudrabha/Wav2Lip#train)，我们与其类似采用self-reconstruction的方式在LRS2 dataset上训练。迁移到不同数据集上训练有一定困难，若是从网络上收集的数据首先需要进行音视频对齐，其次训练lip-sync判别器，最后训练lip-sync network，具体可以参考[这里](https://github.com/Rudrabha/Wav2Lip#training-on-datasets-other-than-lrs2)。

中文的唇动效果似乎比较差？请问是否有对中文做优化？

> @kunncheng 看SadTalker的项目说是在VoxCeleb1 数据集上训练的，感觉中文的唇动效果似乎要比video-retalking效果好一些。不知道是否有计划提供一些其他数据集上进行训练的模型。 SadTalker是驱动单张图像，本项目是编辑视频，多帧与单帧任务之间难度不同，这也是DNet所要解决的问题，希望能将多帧驱动简化为单帧，即将口型归一化。也尝试过在别的数据集上训练，但难以收敛或性能未取得明显提升。因此暂时没有该计划

output resolution problem

You can try to use GFPGAN/GPEN again or other super-resolution/face restoration methods to enhance the generated video. It's worth noting that GFPGAN will have a slightly change on identity.