RAD-NeRF bad video quality after training base on my video.

Dear ashawkey

thanks for your great project. I had exactly followed the process written in the readme, the original video is total 4 minutes (25 fps).
and I have trained 200000 iters for head + additional 50000 iters for fine-tuning the lips. (so total is 250000 iters) but finally I got the synthetic video like this. do you have any suggestion ? how can I get the similar quality synthetic video like the demo Obama video provided by you? thanks lot!

https://user-images.githubusercontent.com/45660925/209437878-28e8a7cf-2192-41e6-a59b-54185c1e39da.mp4

Dec 24 '22 13:12 ruanjiyang

you can see, the eyes looks very strange, and the speaking lips also looks very strange.

Dec 24 '22 13:12 ruanjiyang

@ruanjiyang Hi,

It seems the eyes are not well learned. In this case, you could try to fix the eye movement using --fix_eye 0.25.
The lips sync for non-English datasets is usually worse due to the ASR model.
For the torso, it seems some semantic segmentation is wrong. Training a torso model may help.

Dec 25 '22 00:12 ashawkey

@ruanjiyang Hi,

It seems the eyes are not well learned. In this case, you could try to fix the eye movement using --fix_eye 0.25.

The lips sync for non-English datasets is usually worse due to the ASR model.

For the torso, it seems some semantic segmentation is wrong. Training a torso model may help.

Dear Ashawkey

thanks for your feedback. let me try again.

Dec 25 '22 06:12 ruanjiyang

I have tried to use Chinese version wav2vec2, see the following line:

parser.add_argument('--model', type=str, default='ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt')

and I found the audio_dim for this model is 21128, which is much large than the 'cpierse/wav2vec2-large-xlsr-53-esperanto' model which is only 44.

Is there anything wrong? should I use such large audio_dim for 'ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt'?

thanks.

Dec 27 '22 04:12 ruanjiyang

This is caused by too many chinese character classes. I'm afraid this will be too large for the MLP to work well, but you could try. In fact, character label is not very suitable to guide the lips since we actually needs the sound (phoneme).

Dec 27 '22 06:12 ashawkey

Many thanks to your contribution! Great work!

I have the same issue.

The dataset is around 5 minutes data (25 fps) talking in Mandarin.

Expressive It seems the lip can open and close based on the voice. However, the shape of lip is not very expressive. I tried to fine-tune lips with more iters , LPIPS loss doesn't improve. Do I need more training data or change audio feature extraction method? Any comments ?

Open during silence When there's no voice, the mouth appear to be open usually. How can I close the lips during silence?

Feb 08 '23 08:02 Erickrus

@Erickrus Hi, could you check the performance on self-driven testset? Which ASR model are you using? Finetuning lips majorly aims to improve the sharpness, and may not be helpful in enhancing lip-sync.

Feb 08 '23 11:02 ashawkey

Hi @Erickrus The latest chinese deepspeech ASR model deepspeech-0.9.3-models-zh-CN.pbmm might work. I'm trying it.

Feb 09 '23 03:02 JuneoXIE

log_ngp.txt after --finetune_lips step

++> Evaluate at epoch 37 ...
PSNR = 26.028605
LPIPS (alex) = 0.082468

Performance on self-driven Testset:

The performance is better than TTS. lip sync still shows some inconsistencies (not too many), => not very responsive to voice (which indicates the features are not 100% aligned with voice in time)
Some movements are still like open/close, compared to GT
In some cases, b, p mouth doesnt close

ASR model (by default): cpierse/wav2vec2-large-xlsr-53-esperanto

# try to visualize the audio features
data = np.reshape(data, [data.shape[0], data.shape[1]*data.shape[2]]) # [837, 16*44]
data = (data-np.min(data))/(np.max(data)-np.min(data))
im = Image.fromarray((data * 255.).astype(np.uint8))
im

It seems the feature is not distinguishable from char to char (compared to melspectrogram)

For ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt, maybe could merge logits based on pinyin code

Feb 09 '23 03:02 Erickrus

Yes, the current audio processing pipeline is quite problematic for chinese...

Feb 10 '23 02:02 ashawkey

I have tried to use Chinese version wav2vec2, see the following line:

parser.add_argument('--model', type=str, default='ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt')

and I found the audio_dim for this model is 21128, which is much large than the 'cpierse/wav2vec2-large-xlsr-53-esperanto' model which is only 44.

Is there anything wrong? should I use such large audio_dim for 'ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt'?

thanks.

In my experiments, using 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' is better for Chinese （3503 to 64）. Instead, 'ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-cn-gpt' this model will make the mouth static (21128 to 64).

And this Chinese-ASR project is quite useful: https://github.com/chenkui164/FastASR (in real-time)

Feb 10 '23 04:02 a312863063

@a312863063 Hi, how do you merge the original logits into a low-dimension vector?

Feb 13 '23 03:02 JuneoXIE

@a312863063 Hi, how do you merge the original logits into a low-dimension vector?

Hi, you can see how it maps the predicted vector of any dimension to the 64-dimensional features in here. If the input dimension is too high or the predicted vector is not accurate, the effect will not be very good.

I just directly passed the ASR prediction results to AudioNet. Maybe you could do some change to the AudioNet to make it adapt to the new ASR, good luck!

Feb 13 '23 13:02 a312863063

@a312863063 Hi, how do you merge the original logits into a low-dimension vector?

Hi, you can see how it maps the predicted vector of any dimension to the 64-dimensional features in here. If the input dimension is too high or the predicted vector is not accurate, the effect will not be very good.

I just directly passed the ASR prediction results to AudioNet. Maybe you could do some change to the AudioNet to make it adapt to the new ASR, good luck!

Is there any improvement , when switching to 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' ?

Feb 14 '23 09:02 Erickrus

@a312863063 Hi, how do you merge the original logits into a low-dimension vector?

Hi, you can see how it maps the predicted vector of any dimension to the 64-dimensional features in here. If the input dimension is too high or the predicted vector is not accurate, the effect will not be very good. I just directly passed the ASR prediction results to AudioNet. Maybe you could do some change to the AudioNet to make it adapt to the new ASR, good luck!

Is there any improvement , when switching to 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' ?

I tried but failed. So how should I change the encoder_conv module of AudioNet? The audio dim_in of 'wav2vec2-large-xlsr-53-chinese-zh-cn' is 3503 which is far more than 44. https://github.com/ashawkey/RAD-NeRF/blob/32a5aba2d102b62a2c0a7adbf4e1e6e7564e8e44/nerf/network.py#L46

Feb 14 '23 10:02 cmmclee

@a312863063 Hi, how do you merge the original logits into a low-dimension vector?

Hi, you can see how it maps the predicted vector of any dimension to the 64-dimensional features in here. If the input dimension is too high or the predicted vector is not accurate, the effect will not be very good. I just directly passed the ASR prediction results to AudioNet. Maybe you could do some change to the AudioNet to make it adapt to the new ASR, good luck!

Is there any improvement , when switching to 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' ?

ASR result of model 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn'：大家豪我 c l瑞月就 c塞今日姑远临防防连控机止林开 li良 c场西门发布会在音月八日下午三时的发布会上姑院联防连控机制将介绍第史版防控发案地有关情况国家级控局相关司局负责同治和中国集控中心专家将初起逸月期日院临防连控机制以举办西文发布会介绍了农村地区异情流行期间结合病毒便意情况意情流行强度医疗资源复合和社会运转情况综合评估事时依法采取离时性的防控所施皆少职元聚集降低一人院流动建今感染者段时期巨增队社会运行和医疗资源等的充击春杰吉将莱林怨在卖回家的人能够抱着评安庸着见康拆着幸福鞋着快乐漏 cá 温馨带着田蜜先着才运麦鲁加门进请开心二年二三年会是个美好的心开端

Compositing video is like (NOT SO GOOD with a lot of AMBIGUITY and WRONG PRONOUNCIATION):

ASR result of model paraformer from FastASR：大家好我是瑞瑞就在今日国务院联防联控机制连开两场新闻发布会就在一月八日下午三时的发布会上国务院联防联控机制将介绍第十版防控方案的有关情况国家疾控局相关司局负责同志和中国疾控中心专家将出席一月七日国务院联防联控机制已举办新闻发布会介绍了农村地区疫情防控有关情况就在昨日春运正式开启不少小伙伴已踏上返乡行程返乡途中如何做好防护返乡初期要注意什么返乡后出现症状怎么办该方案明确要加强监测预警优化检测策略调整传染源管理方式等并提出在疫情流行期间结合病毒变异情况疫情流行强度医疗资源负荷和社会运转情况综合评估适时依法采取临时性的防控措施减少人员聚集降低人员流动减轻感染者短时期剧增对社会运行和医疗资源等的冲击春节即将来临愿在外回家的人能够抱着平安拥着健康揣着幸福携着快乐搂着温馨带着甜蜜牵着财运迈入家门尽情开心二零二三年会是个美好的新开端

I'm optimizing this to see if ASR accuracy affects lip synthesis...

Feb 14 '23 11:02 a312863063

Did you figure out if ASR accuracy affects lip synthesis? I have tried several chinese ASR, such as 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn' and 'TencentGameMate/chinese-wav2vec2-large'. But the result has not been significantly improved. What about your trails？

Feb 16 '23 02:02 cmmclee

Hi @Erickrus The latest chinese deepspeech ASR model deepspeech-0.9.3-models-zh-CN.pbmm might work. I'm trying it.

Have you tried this model? I found that the pbmm file format is not compatible with currently used deepspeech model.

Feb 16 '23 02:02 flyingshan

@flyingshan Hi, I tried this pbmm model and found the same problem...I also tried the chinese version wav2vec 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn', but I didn't achieve the same performance as what @a312863063 shows above, probably because my training video is not suitable.

Feb 16 '23 04:02 JuneoXIE

Hi @Erickrus The latest chinese deepspeech ASR model deepspeech-0.9.3-models-zh-CN.pbmm might work. I'm trying it.

Have you tried this model? I found that the pbmm file format is not compatible with currently used deepspeech model.

Please notice .pbmm is not equal to .pb, you have to convert it manually from checkpoints. Of course you can rewrite the deepspeech feature part to be compatible to .pbmm format.

You can look into deepspeech.cc

Feb 16 '23 08:02 Erickrus

@ashawkey Hi, sorry for bothering you again... I've trained on three different videos and tried three asr models, including the default wav2vec, deepspeech 0.6.0, and jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn. However I got the reconstructions with totally static faces. I guess the problem is not caused by the asr model. Please give me some suggestions. Thank you!

This is one of my training videos (about 4 min): man_1.zip

This is the reconstruction using jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn: https://user-images.githubusercontent.com/38695396/219374822-147cd71c-b979-4dbb-9bca-aacc0891db09.mp4

Feb 16 '23 13:02 JuneoXIE

@JuneoXIE The training video looks good, and I think the default wav2vec model should be able to work (at least not totally static). Could you provide the exact command line you use?

Feb 16 '23 15:02 ashawkey

@JuneoXIE The training video looks good, and I think the default wav2vec model should be able to work (at least not totally static). Could you provide the exact command line you use?

Hi thank you for the response! I double-checked the training parameters and found that I mistakenly set the extracting frame rate to 30 fps while my input video had been transformed to 25 fps. The reconstruction with static lips is caused by the non-aligned training data...

The reconstruction using default wav2vec is good! https://user-images.githubusercontent.com/38695396/220499119-cb13a778-d6cb-42b0-9768-8cf5329ed80f.mp4

Feb 22 '23 01:02 JuneoXIE

@JuneoXIE Hello, we also use the model 'jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn', but the effect is not satisfactory. I see that you have similar problems. I don't know how your training support for Chinese is now. Yes, looking forward to your reply

Feb 24 '23 10:02 boolw

给大家提供一个思路：因为asr模型提取出的特征是字的概率而非“语音”的概率，而中文字多、且ASR模型容易识别错，导致提取的特征较弱，所以将ASR模型识别出来的‘字”，转为与语音更相关的“拼音”，乃至于声母和韵母，能够对中文提取出更有效的特征，我的实现： code，我的实验结果来说相比原来有所提升，希望对大家有帮助。

Mar 16 '23 09:03 flyingshan

给大家提供一个思路：因为asr模型提取出的特征是字的概率而非“语音”的概率，而中文字多、且ASR模型容易识别错，导致提取的特征较弱，所以将ASR模型识别出来的‘字”，转为与语音更相关的“拼音”，乃至于声母和韵母，能够对中文提取出更有效的特征，我的实现： code，我的实验结果来说相比原来有所提升，希望对大家有帮助。

我试过您提供的方法，效果并没有提升。还有一点，这种方法对于多音字的情况，会产生新的误差。不知道您的实验效果怎样？有哪些我理解不当的地方？

Mar 28 '23 09:03 cmmclee

给大家提供一个思路：因为asr模型提取出的特征是字的概率而非“语音”的概率，而中文字多、且ASR模型容易识别错，导致提取的特征较弱，所以将ASR模型识别出来的‘字”，转为与语音更相关的“拼音”，乃至于声母和韵母，能够对中文提取出更有效的特征，我的实现： code，我的实验结果来说相比原来有所提升，希望对大家有帮助。

我试过您提供的方法，效果并没有提升。还有一点，这种方法对于多音字的情况，会产生新的误差。不知道您的实验效果怎样？有哪些我理解不当的地方？

多音字的问题我也没找到办法解决。我实验的时候使用这种音素的方式驱动同步效果更好一些，但是理论上这个方法比较依赖ASR识别的准确度，而models--jonatasgrosman--wav2vec2-large-xlsr-53-chinese-zh-cn 这个模型的ASR准确度不是很高，对于一些语音识别不准确，可能会导致效果下降。

Mar 29 '23 10:03 flyingshan

@flyingshan 您能否提供一个 demo 视频呢？

Mar 30 '23 08:03 cmmclee

@flyingshan 您能否提供一个 demo 视频呢？

抱歉，我是在自己拍摄的视频上实验的，没有得到被拍摄人的允许，不太方便发出来哈

Mar 31 '23 01:03 flyingshan

I also got blinking eyes result...

Apr 07 '23 09:04 huangxin168

RAD-NeRF RAD-NeRF copied to clipboard

bad video quality after training base on my video.

RAD-NeRF
RAD-NeRF copied to clipboard