DINet_optimized Wav2vec mapping code

Hello,

Could you please add the code to train wav2vec mapping in deepspeech?

Thank you.

Sep 21 '23 06:09 k0ngolab

Hi,

I am at the moment in the process of removing wav2vec with better solution to support other languages. If it works, will add a new model with new mapping and beside training code soon. Otherwise I will update the repo with the wav2vec mapping.

Sep 24 '23 22:09 Elsaam2y

I tried retraining the model and syncnet with the latest version of deepspeech but this didn't lead to nice results compared to using the originally trained model. The generalization and the expressivity of the lips motion were not very convincing. An alternative solution would be training a mapping model fro the latest version of deepspeech to the original version used with DINet. This would keep the same trained model of DINet, beside keeping the inference fast as the latest version of deepspeech supports GPU and onnx. Didn't have time to test it yet but feel free to give it a try and open a PR.

Nov 18 '23 23:11 Elsaam2y

I tried retraining the model and syncnet with the latest version of deepspeech but this didn't lead to nice results compared to using the originally trained model. The generalization and the expressivity of the lips motion were not very convincing. An alternative solution would be training a mapping model fro the latest version of deepspeech to the original version used with DINet. This would keep the same trained model of DINet, beside keeping the inference fast as the latest version of deepspeech supports GPU and onnx. Didn't have time to test it yet but feel free to give it a try and open a PR.

请问你后面使用的是哪个版本的 deepspeech，训练过程中维度不一致的问题是怎么解决的呢，谢谢

May I ask which version of deepspeech you are using later, and how to solve the problem of inconsistent dimensions during the training process? Thank you.

Apr 01 '24 04:04 tailangjun

I was using 0.9.1 and the dimensions issue is raised mainly from other languages, like Chinese. I tried learn mapping this obtained features to the expected dimensions but this didn't always work good. Furthermore, deepspeech seems to cause many problems with many different languages and that's why I am trying to rely mainly on melspectrograms at the moment.

Apr 24 '24 07:04 Elsaam2y

I was using 0.9.1 and the dimensions issue is raised mainly from other languages, like Chinese. I tried learn mapping this obtained features to the expected dimensions but this didn't always work good. Furthermore, deepspeech seems to cause many problems with many different languages and that's why I am trying to rely mainly on melspectrograms at the moment.

I'm curious about what's the difference between the original DS model used in Di-Net and the 0.9.1 version? Do they output the same result given the same input audio? If so, since the later version of the DS model supports GPU and onnx, it already benefits from speed improvement from this feature. Otherwise, maybe its better to train end-to-end using language-agnostic feature like HuBERT?

Oct 11 '24 08:10 PengYicong