Cross-Modal Perceptionist

CVPR 2022 "Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?"

Cho-Ying Wu, Chin-Cheng Hsu, Ulrich Neumann, University of Southern California

[Paper] [Project page] [Voxceleb-3D Data]

[TODO]: 2. Evaluation code 3. Training code

We study the cross-modal learning and analyze the correlation between voices and 3D face geometry. Unlike previous methods for studying this correlation between voices and faces and only work on the 2D domain, we choose 3D representation that can better validate the supportive evidence from the physiology of the correlation between voices and skeletal and articulator structures, which potentially affect facial geometry.

Comparison of recovered 3D face meshes with the baseline.

Consistency for the same identity using different utterances.

Demo: Preprocessed fbank

We test on Ubuntu 16.04 LTS, NVIDIA 2080 Ti (only GPU is supported), and use anaconda for installing packages

Install packages

conda create --name CMP python=3.8
Install PyTorch compatible to your computer, we test on PyTorch v1.9 (should be compatible with other 1.0+ versions)
install other dependency: opencv-python, scipy, PIL, Cython, pyaudio

Or use the environment.yml we provide instead:
- conda env create -f environment.yml
- conda activate CMP
Build the rendering toolkit (by c++ and cython) for overlapping 3D meshes on images with configurations
```
cd Sim3DR
bash build_sim3dr.sh
cd ..
```

Download pretrained models and 3DMM configuration data

Download from [here] (~160M) and unzip under the root folder. This will create 'pretrained_models' and 'train.configs' under the root folder.

Read the preprocessed fbank for inference

python demo.py (This will fetch the preprocessed MFCC and use them as network inputs)
Results will be generated under data/results/ (pre-generated references are under data/results_reference)

More preprocessed MFCC and 3D mesh (3DMM params) pairs can be downloaded: [Voxceleb-3D Data].

Demo: Use device mic input

Do the above 1-5 step. Plus, download the face type meshes and extract under ./face_types
python demo_mic.py The demo will take 5 seconds recording from your device and predict the face mesh.

We perform unsupervised gender classfication based on mean male and female shape and calculate the statistics between the predicted face and mean shape. Also we calculate the distance between the four types of faces (Regular, Slim, Skinny, Wide)and indicate which type the voice is closer to.

Results will be generated under data/results

Citation

If you find our work useful, please consider to cite us.

@inproceedings{wu2022cross,
title={Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?},
author={Wu, Cho-Ying and Hsu, Chin-Cheng and Neumann, Ulrich},
booktitle={CVPR},
year={2022}
}

This project is developed on [SynergyNet], [3DDFA-V2] and [reconstruction-faces-from-voice]

Cross-Modal-Perceptionist
Cross-Modal-Perceptionist copied to clipboard

Metadata

Cross-Modal Perceptionist

Demo: Preprocessed fbank

Demo: Use device mic input

Citation

← Metadata

Owner

Metadata

Cross-Modal-Perceptionist Cross-Modal-Perceptionist copied to clipboard

Metadata

Cross-Modal Perceptionist

Demo: Preprocessed fbank

Demo: Use device mic input

Citation

← Metadata

Owner

Metadata

Cross-Modal-Perceptionist
Cross-Modal-Perceptionist copied to clipboard