audio2photoreal icon indicating copy to clipboard operation
audio2photoreal copied to clipboard

is it realtime audio 2 face ?

Open kingkong135 opened this issue 1 year ago • 5 comments

Hello,

Firstly, I want to extend my sincere thanks for the great work on this repository.

I have a question regarding the functionality: Is the audio-to-face feature designed to work in real-time?

kingkong135 avatar Jan 04 '24 03:01 kingkong135

Depends how much compute you throw on it and how fast your GPUs are. You can try it out on whatever compute you have available ;)

alexanderrichard avatar Jan 04 '24 03:01 alexanderrichard

I believe that if you don't use the rendering portion, you can just run this in realtime locally on consumer devices. Incidentally, please do this for the community: https://github.com/facebookresearch/audio2photoreal/issues/4

yosun avatar Jan 05 '24 00:01 yosun

if anyone have issue with enviroment, you can use docker:

docker run -dit --name a2p nvidia/cuda:11.6.1-devel-ubuntu20.04
docker exec -it a2p bash 
apt update
apt install vim git wget gcc ffmpeg libsm6 libxext6  -y

# install miniconda 
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
/root/miniconda3/bin/conda init bash 

# install repo ...

kingkong135 avatar Jan 05 '24 07:01 kingkong135

from the demo description: "4) Then, sit back and wait for the rendering to happen! This may take a while (e.g. 30 minutes)" Not sure if it will help to answer the question, but for a 6s audio clip, on a V100, I got the following times for a single sample.

100% 100/100 [00:17<00:00,  5.71it/s]
created 3 samples
100% 100/100 [00:07<00:00, 14.13it/s]
created 3 samples
100% 120/120 [02:36<00:00,  1.31s/it]

Not sure what the 3rd step is (I assume the avatar renderer is more performant) Anyway as much as the first two networks, are close to real time, the last process is 30x slower than real time on a modest GPU.

wandrzej avatar Jan 06 '24 14:01 wandrzej

indeed, providing a bone dump to SMPL or Unity biped animation bones could eliminate the third time-consuming step to make this an actual realtime technology

https://github.com/facebookresearch/audio2photoreal/issues/4

yosun avatar Jan 06 '24 23:01 yosun