voicepuppet
voicepuppet copied to clipboard
Audio driven video synthesis
VoicePuppet
- This repository provided a common pipeline to generate speaking actor by voice input automatically.
- For a better feeling, there's a short video to demostrate it.
The archecture of the network
- Composed of 2 parts, one for predict 3D face coeffcients of each frame align to a certain stride window size of waveform, called BFMNet(basel face model network). The another for redraw the real face foreground using the rasterized face which produced by the rendered 3D face coeffcients of previous step, called PixReferNet.
|
|---|
| BFMNet component |
|
| PixReferNet component |
Run the prediction pipeline
- Download the pretrained model and required models.
Baidu Disk: [ckpt.zip, code: a6pn], [allmodels.zip, code: brfh]
or Google Drive: [ckpt.zip], [allmodels.zip]
Extract the
ckpt.ziptockpt_bfmnetandckpt_pixrefer, extract theallmodels.zipto current root dir cd utils/cython&&python3 setup.py install- Install ffmpeg tool if you want to merge the png sequence and audio file to video container like mp4.
python3 voicepuppet/pixrefer/infer_bfmvid.py --config_path config/params.yml sample/22.jpg sample/test.aac
Run the training pipeline
Requirements
- tensorflow>=1.14.0
- pytorch>=1.4.0, only for data preparation (face foreground segmentation and matting)
- mxnet>=1.5.1, only for data preparation (face alignment) tips: you can use other models to do the same label marking instead, such as dlib
Data preparation
- Check your
config/params.ymlto make sure the dataset folder in specified structure (same as the grid dataset, you can extend the dataset by using the same folder structure which contains common video files)
|- srcdir/
| |- s10/
| |- video/
| |- mpg_6000/
| |- bbab8n.mpg
| |- bbab9s.mpg
| |- bbac1a.mpg
| |- ...
| |- s8/
| |- video/
| |- mpg_6000/
| |- bbae5n.mpg
| |- bbae6s.mpg
| |- bbae7p.mpg
| |- ...
-
Extract audio stream from mpg video file,
todirwas a output folder which you want to store the labels.python3 datasets/make_data_from_GRID.py --gpu 0 --step 2 srcdir todir -
Face detection and alignment
python3 datasets/make_data_from_GRID.py --gpu 0 --step 3 srcdir todir ./allmodels -
3D face reconstruction
python3 datasets/make_data_from_GRID.py --gpu 0 --step 4 todir ./allmodels -
It will take several hours to finish the above steps, subsequently, you'll find there's
*.jpg, landmark.txt, audio.wav, bfmcoeff.txtin each output subfolder. The above labels(audio.wav,bfmcoeff.txt) are used for BFMNet training, the others are only temp files.
|- todir/
| |- s10/
| |- bbab8n/
| |- landmark.txt
| |- audio.wav
| |- bfmcoeff.txt
| |- 0.jpg
| |- 1.jpg
| |- ...
| |- bbab9s/
| |- ...
| |- s8/
| |- bbae5n/
| |- landmark.txt
| |- audio.wav
| |- bfmcoeff.txt
| |- 0.jpg
| |- 1.jpg
| |- ...
| |- bbae6s/
| |- ...
- Face(human foreground) segmentation and matting for PixelReferNet training. Before invoke the python shell, you should make sure the width and height of the video was in the same size(1:1). In general, 3-5 minutes video was enough for training the PixelReferNet network, the trained model will only take effect on this specified person too.
python3 datasets/make_data_from_GRID.py --gpu 0 --step 6 src_dir to_dvp_dir ./allmodelsthesrc_dirhas the same folder structure as [tip1 in Data preparation], when finish the above step, you will find*.jpgin subfolders, like this
Train BFMNet
- Prepare train and eval txt, check the
root_pathparameter inconfig/params.ymlis the output folder of [tip1 in Data preparation]python3 datasets/makelist_bfm.py --config_path config/params.yml - train the model
python3 voicepuppet/bfmnet/train_bfmnet.py --config_path config/params.yml - Watch the evalalute images every 1000 step in
log/eval_bfmnet, the upper was the target sequence, and the under was the evaluated sequence.
Train PixReferNet
- Prepare train and eval txt, check the
root_pathparameter inconfig/params.ymlis the output folder of [tip6 in Data preparation]python3 datasets/makelist_pixrefer.py --config_path config/params.yml - train the model
python3 voicepuppet/pixrefer/train_pixrefer.py --config_path config/params.yml - Use tensorboard to watch the training process
tensorboard --logdir=log/summary_pixrefer
Acknowledgement
- The face alignment model was refer to Deepinx's work, it's more stable than Dlib.
- 3D face reconstruction model was refer to microsoft's work
- Image segmentation model was refer to gasparian's work
- Image matting model was refer to foamliu's work