RAD-NeRF
RAD-NeRF copied to clipboard
bad video quality after training base on my video.
Dear ashawkey
thanks for your great project.
I had exactly followed the process written in the readme, the original video is total 4 minutes (25 fps).
and I have trained 200000 iters for head + additional 50000 iters for fine-tuning the lips. (so total is 250000 iters) but finally I got the synthetic video like this. do you have any suggestion ? how can I get the similar quality synthetic video like the demo Obama video provided by you? thanks lot!
https://user-images.githubusercontent.com/45660925/209437878-28e8a7cf-2192-41e6-a59b-54185c1e39da.mp4
you can see, the eyes looks very strange, and the speaking lips also looks very strange.
@ruanjiyang Hi,
- It seems the eyes are not well learned. In this case, you could try to fix the eye movement using
--fix_eye 0.25
. - The lips sync for non-English datasets is usually worse due to the ASR model.
- For the torso, it seems some semantic segmentation is wrong. Training a torso model may help.
@ruanjiyang Hi,
- It seems the eyes are not well learned. In this case, you could try to fix the eye movement using
--fix_eye 0.25
.- The lips sync for non-English datasets is usually worse due to the ASR model.
- For the torso, it seems some semantic segmentation is wrong. Training a torso model may help.
Dear Ashawkey
thanks for your feedback. let me try again.
I have tried to use Chinese version wav2vec2, see the following line:
parser.add_argument('--model', type=str, default='ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt')
and I found the audio_dim for this model is 21128, which is much large than the 'cpierse/wav2vec2-large-xlsr-53-esperanto' model which is only 44.
Is there anything wrong? should I use such large audio_dim for 'ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt'?
thanks.
This is caused by too many chinese character classes. I'm afraid this will be too large for the MLP to work well, but you could try. In fact, character label is not very suitable to guide the lips since we actually needs the sound (phoneme).
Many thanks to your contribution! Great work!
I have the same issue.
The dataset is around 5 minutes data (25 fps) talking in Mandarin.
Expressive It seems the lip can open and close based on the voice. However, the shape of lip is not very expressive. I tried to fine-tune lips with more iters , LPIPS loss doesn't improve. Do I need more training data or change audio feature extraction method? Any comments ?
Open during silence When there's no voice, the mouth appear to be open usually. How can I close the lips during silence?
@Erickrus Hi, could you check the performance on self-driven testset? Which ASR model are you using? Finetuning lips majorly aims to improve the sharpness, and may not be helpful in enhancing lip-sync.
Hi @Erickrus The latest chinese deepspeech ASR model deepspeech-0.9.3-models-zh-CN.pbmm might work. I'm trying it.
log_ngp.txt after --finetune_lips step
++> Evaluate at epoch 37 ...
PSNR = 26.028605
LPIPS (alex) = 0.082468
Performance on self-driven Testset:
- The performance is better than TTS. lip sync still shows some inconsistencies (not too many), => not very responsive to voice (which indicates the features are not 100% aligned with voice in time)
- Some movements are still like open/close, compared to GT
- In some cases, b, p mouth doesnt close
ASR model (by default): cpierse/wav2vec2-large-xlsr-53-esperanto
# try to visualize the audio features
data = np.reshape(data, [data.shape[0], data.shape[1]*data.shape[2]]) # [837, 16*44]
data = (data-np.min(data))/(np.max(data)-np.min(data))
im = Image.fromarray((data * 255.).astype(np.uint8))
im
It seems the feature is not distinguishable from char to char (compared to melspectrogram)
For ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt, maybe could merge logits based on pinyin code
Yes, the current audio processing pipeline is quite problematic for chinese...
I have tried to use Chinese version wav2vec2, see the following line:
parser.add_argument('--model', type=str, default='ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt')
and I found the audio_dim for this model is 21128, which is much large than the 'cpierse/wav2vec2-large-xlsr-53-esperanto' model which is only 44.
Is there anything wrong? should I use such large audio_dim for 'ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt'?
thanks.
In my experiments, using 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' is better for Chinese (3503 to 64). Instead, 'ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-cn-gpt' this model will make the mouth static (21128 to 64).
And this Chinese-ASR project is quite useful: https://github.com/chenkui164/FastASR (in real-time)
@a312863063 Hi, how do you merge the original logits into a low-dimension vector?
@a312863063 Hi, how do you merge the original logits into a low-dimension vector?
Hi, you can see how it maps the predicted vector of any dimension to the 64-dimensional features in here. If the input dimension is too high or the predicted vector is not accurate, the effect will not be very good.
I just directly passed the ASR prediction results to AudioNet. Maybe you could do some change to the AudioNet to make it adapt to the new ASR, good luck!
@a312863063 Hi, how do you merge the original logits into a low-dimension vector?
Hi, you can see how it maps the predicted vector of any dimension to the 64-dimensional features in here. If the input dimension is too high or the predicted vector is not accurate, the effect will not be very good.
I just directly passed the ASR prediction results to AudioNet. Maybe you could do some change to the AudioNet to make it adapt to the new ASR, good luck!
Is there any improvement , when switching to 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' ?
@a312863063 Hi, how do you merge the original logits into a low-dimension vector?
Hi, you can see how it maps the predicted vector of any dimension to the 64-dimensional features in here. If the input dimension is too high or the predicted vector is not accurate, the effect will not be very good. I just directly passed the ASR prediction results to AudioNet. Maybe you could do some change to the AudioNet to make it adapt to the new ASR, good luck!
Is there any improvement , when switching to 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' ?
I tried but failed. So how should I change the encoder_conv module of AudioNet? The audio dim_in of 'wav2vec2-large-xlsr-53-chinese-zh-cn' is 3503 which is far more than 44. https://github.com/ashawkey/RAD-NeRF/blob/32a5aba2d102b62a2c0a7adbf4e1e6e7564e8e44/nerf/network.py#L46
@a312863063 Hi, how do you merge the original logits into a low-dimension vector?
Hi, you can see how it maps the predicted vector of any dimension to the 64-dimensional features in here. If the input dimension is too high or the predicted vector is not accurate, the effect will not be very good. I just directly passed the ASR prediction results to AudioNet. Maybe you could do some change to the AudioNet to make it adapt to the new ASR, good luck!
Is there any improvement , when switching to 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' ?
ASR result of model 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn': 大家豪我 c l瑞 月 就 c塞 今日姑远临防 防 连控机止林 开 li良 c场 西 门 发布会 在音月八日 下午三时的发布会 上 姑院 联防连控机制 将介绍第 史版防控发案 地有关情况 国家级控局 相关司局负责 同治和中国 集控中心专家 将初起 逸月期日 院临 防 连控机制以 举 办 西文发布会 介绍了农 村 地区异情流行 期间结 合病毒便意情 况 意情流 行强度 医疗 资源复合 和社会运转 情况综合评 估事时 依法采取离时 性的防控所施 皆少职元 聚集 降低一人院流动 建今感染 者段时期巨 增队社会运行 和医疗 资源等的充击 春杰 吉将莱林 怨在卖 回家 的人能够抱着评 安庸着见 康 拆着幸 福鞋 着 快 乐 漏 cá 温 馨带着田 蜜 先着 才 运麦鲁加 门进请开 心 二年二三年会 是个美好的心 开端
Compositing video is like (NOT SO GOOD with a lot of AMBIGUITY and WRONG PRONOUNCIATION):
ASR result of model paraformer from FastASR: 大家好我是瑞瑞就在今日国务院联防联控机制连开两场新闻发布会就在一月八日下午三时的发布会上国务院联防联控机制将介绍第十版防控方案的有关情况国家疾控局相关司局负责同志和中国疾控中心专家将出席一月七日国务院联防联控机制已举办新闻发布会介绍了农村地区疫情防控有关情况就在昨日春运正式开启不少小伙伴已踏上返乡行程返乡途中如何做好防护返乡初期要注意什么返乡后出现症状怎么办该方案明确要加强监测预警优化检测策略调整传染源管理方式等并提出在疫情流行期间结合病毒变异情况疫情流行强度医疗资源负荷和社会运转情况综合评估适时依法采取临时性的防控措施减少人员聚集降低人员流动减轻感染者短时期剧增对社会运行和医疗资源等的冲击春节即将来临愿在外回家的人能够抱着平安拥着健康揣着幸福携着快乐搂着温馨带着甜蜜牵着财运迈入家门尽情开心二零二三年会是个美好的新开端
I'm optimizing this to see if ASR accuracy affects lip synthesis...
Did you figure out if ASR accuracy affects lip synthesis? I have tried several chinese ASR, such as 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' and 'TencentGameMate/chinese-wav2vec2-large'. But the result has not been significantly improved. What about your trails?
Hi @Erickrus The latest chinese deepspeech ASR model deepspeech-0.9.3-models-zh-CN.pbmm might work. I'm trying it.
Have you tried this model? I found that the pbmm file format is not compatible with currently used deepspeech model.
@flyingshan Hi, I tried this pbmm model and found the same problem...I also tried the chinese version wav2vec 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn', but I didn't achieve the same performance as what @a312863063 shows above, probably because my training video is not suitable.
Hi @Erickrus The latest chinese deepspeech ASR model deepspeech-0.9.3-models-zh-CN.pbmm might work. I'm trying it.
Have you tried this model? I found that the pbmm file format is not compatible with currently used deepspeech model.
Please notice .pbmm is not equal to .pb, you have to convert it manually from checkpoints. Of course you can rewrite the deepspeech feature part to be compatible to .pbmm format.
You can look into deepspeech.cc
@ashawkey Hi, sorry for bothering you again... I've trained on three different videos and tried three asr models, including the default wav2vec, deepspeech 0.6.0, and jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn. However I got the reconstructions with totally static faces. I guess the problem is not caused by the asr model. Please give me some suggestions. Thank you!
This is one of my training videos (about 4 min): man_1.zip
This is the reconstruction using jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn: https://user-images.githubusercontent.com/38695396/219374822-147cd71c-b979-4dbb-9bca-aacc0891db09.mp4
@JuneoXIE The training video looks good, and I think the default wav2vec model should be able to work (at least not totally static). Could you provide the exact command line you use?
@JuneoXIE The training video looks good, and I think the default wav2vec model should be able to work (at least not totally static). Could you provide the exact command line you use?
Hi thank you for the response! I double-checked the training parameters and found that I mistakenly set the extracting frame rate to 30 fps while my input video had been transformed to 25 fps. The reconstruction with static lips is caused by the non-aligned training data...
The reconstruction using default wav2vec is good! https://user-images.githubusercontent.com/38695396/220499119-cb13a778-d6cb-42b0-9768-8cf5329ed80f.mp4
@JuneoXIE Hello, we also use the model 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn', but the effect is not satisfactory. I see that you have similar problems. I don't know how your training support for Chinese is now. Yes, looking forward to your reply
给大家提供一个思路:因为asr模型提取出的特征是字的概率而非“语音”的概率,而中文字多、且ASR模型容易识别错,导致提取的特征较弱,所以将ASR模型识别出来的‘字”,转为与语音更相关的“拼音”,乃至于声母和韵母,能够对中文提取出更有效的特征,我的实现: code,我的实验结果来说相比原来有所提升,希望对大家有帮助。
给大家提供一个思路:因为asr模型提取出的特征是字的概率而非“语音”的概率,而中文字多、且ASR模型容易识别错,导致提取的特征较弱,所以将ASR模型识别出来的‘字”,转为与语音更相关的“拼音”,乃至于声母和韵母,能够对中文提取出更有效的特征,我的实现: code,我的实验结果来说相比原来有所提升,希望对大家有帮助。
我试过您提供的方法,效果并没有提升。还有一点,这种方法对于多音字的情况,会产生新的误差。不知道您的实验效果怎样?有哪些我理解不当的地方?
给大家提供一个思路:因为asr模型提取出的特征是字的概率而非“语音”的概率,而中文字多、且ASR模型容易识别错,导致提取的特征较弱,所以将ASR模型识别出来的‘字”,转为与语音更相关的“拼音”,乃至于声母和韵母,能够对中文提取出更有效的特征,我的实现: code,我的实验结果来说相比原来有所提升,希望对大家有帮助。
我试过您提供的方法,效果并没有提升。还有一点,这种方法对于多音字的情况,会产生新的误差。不知道您的实验效果怎样?有哪些我理解不当的地方?
多音字的问题我也没找到办法解决。我实验的时候使用这种音素的方式驱动同步效果更好一些,但是理论上这个方法比较依赖ASR识别的准确度,而models--jonatasgrosman--wav2vec2-large-xlsr-53-chinese-zh-cn 这个模型的ASR准确度不是很高,对于一些语音识别不准确,可能会导致效果下降。
@flyingshan 您能否提供一个 demo 视频呢?
@flyingshan 您能否提供一个 demo 视频呢?
抱歉,我是在自己拍摄的视频上实验的,没有得到被拍摄人的允许,不太方便发出来哈
I also got blinking eyes result...