VoicePuppet

This repository provided a common pipeline to generate speaking actor by voice input automatically.
For a better feeling, there's a short video to demostrate it.

The archecture of the network

Composed of 2 parts, one for predict 3D face coeffcients of each frame align to a certain stride window size of waveform, called BFMNet(basel face model network). The another for redraw the real face foreground using the rasterized face which produced by the rendered 3D face coeffcients of previous step, called PixReferNet.


BFMNet component

PixReferNet component

Run the prediction pipeline

Download the pretrained model and required models. Baidu Disk: [ckpt.zip, code: a6pn], [allmodels.zip, code: brfh] or Google Drive: [ckpt.zip], [allmodels.zip] Extract the ckpt.zip to ckpt_bfmnet and ckpt_pixrefer, extract the allmodels.zip to current root dir
cd utils/cython && python3 setup.py install
Install ffmpeg tool if you want to merge the png sequence and audio file to video container like mp4.
python3 voicepuppet/pixrefer/infer_bfmvid.py --config_path config/params.yml sample/22.jpg sample/test.aac

Run the training pipeline

Requirements

tensorflow>=1.14.0
pytorch>=1.4.0, only for data preparation (face foreground segmentation and matting)
mxnet>=1.5.1, only for data preparation (face alignment) tips: you can use other models to do the same label marking instead, such as dlib

Data preparation

Check your config/params.yml to make sure the dataset folder in specified structure (same as the grid dataset, you can extend the dataset by using the same folder structure which contains common video files)

|- srcdir/
|    |- s10/
|        |- video/
|            |- mpg_6000/
|                |- bbab8n.mpg
|                |- bbab9s.mpg
|                |- bbac1a.mpg
|                |- ...
|    |- s8/
|        |- video/
|            |- mpg_6000/
|                |- bbae5n.mpg
|                |- bbae6s.mpg
|                |- bbae7p.mpg
|                |- ...

Extract audio stream from mpg video file, todir was a output folder which you want to store the labels. python3 datasets/make_data_from_GRID.py --gpu 0 --step 2 srcdir todir
Face detection and alignment python3 datasets/make_data_from_GRID.py --gpu 0 --step 3 srcdir todir ./allmodels
3D face reconstruction python3 datasets/make_data_from_GRID.py --gpu 0 --step 4 todir ./allmodels
It will take several hours to finish the above steps, subsequently, you'll find there's *.jpg, landmark.txt, audio.wav, bfmcoeff.txt in each output subfolder. The above labels(audio.wav, bfmcoeff.txt) are used for BFMNet training, the others are only temp files.

|- todir/
|    |- s10/
|        |- bbab8n/
|            |- landmark.txt
|            |- audio.wav
|            |- bfmcoeff.txt
|            |- 0.jpg
|            |- 1.jpg
|            |- ...
|        |- bbab9s/
|            |- ...
|    |- s8/
|        |- bbae5n/
|            |- landmark.txt
|            |- audio.wav
|            |- bfmcoeff.txt
|            |- 0.jpg
|            |- 1.jpg
|            |- ...
|        |- bbae6s/
|            |- ...

Face(human foreground) segmentation and matting for PixelReferNet training. Before invoke the python shell, you should make sure the width and height of the video was in the same size(1:1). In general, 3-5 minutes video was enough for training the PixelReferNet network, the trained model will only take effect on this specified person too. python3 datasets/make_data_from_GRID.py --gpu 0 --step 6 src_dir to_dvp_dir ./allmodels the src_dir has the same folder structure as [tip1 in Data preparation], when finish the above step, you will find *.jpg in subfolders, like this

Train BFMNet

Prepare train and eval txt, check the root_path parameter in config/params.yml is the output folder of [tip1 in Data preparation] python3 datasets/makelist_bfm.py --config_path config/params.yml
train the model python3 voicepuppet/bfmnet/train_bfmnet.py --config_path config/params.yml
Watch the evalalute images every 1000 step in log/eval_bfmnet, the upper was the target sequence, and the under was the evaluated sequence.

Train PixReferNet

Prepare train and eval txt, check the root_path parameter in config/params.yml is the output folder of [tip6 in Data preparation] python3 datasets/makelist_pixrefer.py --config_path config/params.yml
train the model python3 voicepuppet/pixrefer/train_pixrefer.py --config_path config/params.yml
Use tensorboard to watch the training process tensorboard --logdir=log/summary_pixrefer

Acknowledgement

The face alignment model was refer to Deepinx's work, it's more stable than Dlib.
3D face reconstruction model was refer to microsoft's work
Image segmentation model was refer to gasparian's work
Image matting model was refer to foamliu's work

voicepuppet
voicepuppet copied to clipboard

Metadata

VoicePuppet

The archecture of the network

Run the prediction pipeline

Run the training pipeline

Requirements

Data preparation

Train BFMNet

Train PixReferNet

Acknowledgement

← Metadata

Owner

Metadata

voicepuppet voicepuppet copied to clipboard

Metadata

VoicePuppet

The archecture of the network

Run the prediction pipeline

Run the training pipeline

Requirements

Data preparation

Train BFMNet

Train PixReferNet

Acknowledgement

← Metadata

Owner

Metadata

voicepuppet
voicepuppet copied to clipboard