NeMoOnnxSharp icon indicating copy to clipboard operation
NeMoOnnxSharp copied to clipboard

Possible to improve English and German pronunciation?

Open GeorgeS2019 opened this issue 2 years ago • 9 comments

NVIDIA NeMo (ByT5 G2P and G2P-Conformer):

NVIDIA NeMo provides grapheme-to-phoneme models for various languages, including German.

The ByT5 G2P model is based on a neural network and can handle out-of-vocabulary words (OOV) and heteronyms (words with the same spelling but different pronunciations).

The G2P-Conformer model is a non-autoregressive CTC model that is faster during inference.

These models allow you to enforce desired pronunciations by providing a phonetic transcript of the input. You can train and evaluate these models using manifest files containing grapheme and phoneme pairs

GeorgeS2019 avatar Oct 14 '23 11:10 GeorgeS2019

image

Is it possible to do this using NeMoOnnxSharp for German?

GeorgeS2019 avatar Mar 21 '24 08:03 GeorgeS2019

It supports both German TTS/ASR. See this https://github.com/kaiidams/NeMoOnnxSharp/blob/ad2ffe375e525bb63c59c9b1cd5154afe70351a0/NeMoOnnxSharp.Example/Program.cs#L39

kaiidams avatar Mar 23 '24 04:03 kaiidams

I have use the code for German

Here is the feedback

  • The volume of TTS for German is softer than when using Microsoft Speech.

GeorgeS2019 avatar Mar 23 '24 05:03 GeorgeS2019

Second,

I have seen Mel and MFCC code. I wonder if these codes can be repurposed for German audio and eventually to extract German phonemes from German Audio

In the entire internet, hardly anything like this. Even Wav2ToVec2 is not often shown how to work with the German langauge.

Can you do something about this?

GeorgeS2019 avatar Mar 23 '24 05:03 GeorgeS2019

It supports both German TTS/ASR. See this

I have tried TTS/ASR for German: My interest is extraction of German Phonemes from German Audio

GeorgeS2019 avatar Mar 23 '24 05:03 GeorgeS2019

In case of German, their pronunciation is not ambiguous. Why do you need a phonemizer? In case of English, NeMo FastPitch was trained with a phonemizer which translates all but ambiguous words, and FastPitch can handle ambiguous words in many cases.

kaiidams avatar Mar 30 '24 02:03 kaiidams

https://github.com/kaiidams/NeMoOnnxSharp/blob/main/NeMoOnnxSharp/TTSTokenizers/EnglishG2p.cs

Is there GermanG2P.cs in NeMoOnnxSharp?

their pronunciation is not ambiguous.

explain please. Not sure I understand how this impacts how to proceed.

GeorgeS2019 avatar Mar 30 '24 06:03 GeorgeS2019

FastPitch is a text-to-speech (TTS) model developed by NVIDIA. It's a fully-parallel transformer architecture with prosody control over pitch and individual phoneme duration¹. Here are some key features:

  • Fully-Parallel Architecture: Unlike traditional TTS models that generate speech sequentially, FastPitch generates speech in parallel, which makes it much faster¹.
  • Prosody Control: FastPitch allows for control over the pitch and duration of individual phonemes, which can make the generated speech more expressive and engaging¹.
  • Transformer-Based: FastPitch is based on the Transformer architecture, which is known for its efficiency and scalability¹.
  • Integration with NeMo: FastPitch can be trained or fine-tuned using NVIDIA's NeMo framework, a generative AI framework built for working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS)².

FastPitch is used for generating mel spectrograms from text, which can then be converted to audio using a vocoder¹. It's trained on the LJSpeech dataset sampled at 22050Hz and has been tested on generating female English voices with an American accent¹. Please note that this model works well with vocoders that were trained on 22050Hz data¹.

Source: Conversation with Bing, 3/30/2024 (1) TTS En FastPitch | NVIDIA NGC. https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_fastpitch. (2) GitHub - NVIDIA/NeMo: NeMo: a framework for generative AI. https://github.com/NVIDIA/NeMo. (3) Google Colab. https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/FastPitch_MixerTTS_Training.ipynb. (4) undefined. https://arxiv.org/abs/2006.06873.

GeorgeS2019 avatar Mar 30 '24 06:03 GeorgeS2019

Is there GermanG2P.cs in NeMoOnnxSharp?

FastPitch of NeMo uses a phonemizer for English but doesn't use for German. NeMoOnnxSharp doesn't contain German phonemizer.

kaiidams avatar Mar 30 '24 06:03 kaiidams