av_hubert
av_hubert copied to clipboard
How to adapt or train AV-HuBERT for other languages?
Thanks for the awesome work! I am wondering if it is possible to make AV-HuBERT work for other languages, e.g., Chinese.
I notice that there is a multilingual version in the paper. Is it compatible with different languages? Otherwise, could you provide any suggestions, assuming there is a Chinese lip movement dataset.
Thanks!
@cooelf Yes, using AV-HuBERT for other languages should also work. You can choose a pre-trained checkpoint (large or base) and fine-tune that with Chinese lip reading dataset following the fine-tuning command and refer to this for how to prepare the data. Alternatively, pre-training an AV-HuBERT model of Chinese version from scratch is also doable if you have sufficiently large amount of the audio-visual data.
We mentioned a multilingually pre-trained AV-HuBERT in the paper but that model was not released as it's not as good as the English-only one on LRS3 benchmark. JFYI, we did multilingually fine-tuned AV-HuBERT in our follow-up work and you can find the model checkpoints in this repo.