Possible to improve English and German pronunciation?
NVIDIA NeMo (ByT5 G2P and G2P-Conformer):
NVIDIA NeMo provides grapheme-to-phoneme models for various languages, including German.
The ByT5 G2P model is based on a neural network and can handle out-of-vocabulary words (OOV) and heteronyms (words with the same spelling but different pronunciations).
The G2P-Conformer model is a non-autoregressive CTC model that is faster during inference.
These models allow you to enforce desired pronunciations by providing a phonetic transcript of the input. You can train and evaluate these models using manifest files containing grapheme and phoneme pairs
Is it possible to do this using NeMoOnnxSharp for German?
It supports both German TTS/ASR. See this https://github.com/kaiidams/NeMoOnnxSharp/blob/ad2ffe375e525bb63c59c9b1cd5154afe70351a0/NeMoOnnxSharp.Example/Program.cs#L39
I have use the code for German
Here is the feedback
- The volume of TTS for German is softer than when using Microsoft Speech.
Second,
I have seen Mel and MFCC code. I wonder if these codes can be repurposed for German audio and eventually to extract German phonemes from German Audio
In the entire internet, hardly anything like this. Even Wav2ToVec2 is not often shown how to work with the German langauge.
Can you do something about this?
It supports both German TTS/ASR. See this
I have tried TTS/ASR for German: My interest is extraction of German Phonemes from German Audio
In case of German, their pronunciation is not ambiguous. Why do you need a phonemizer? In case of English, NeMo FastPitch was trained with a phonemizer which translates all but ambiguous words, and FastPitch can handle ambiguous words in many cases.
https://github.com/kaiidams/NeMoOnnxSharp/blob/main/NeMoOnnxSharp/TTSTokenizers/EnglishG2p.cs
Is there GermanG2P.cs in NeMoOnnxSharp?
their pronunciation is not ambiguous.
explain please. Not sure I understand how this impacts how to proceed.
FastPitch is a text-to-speech (TTS) model developed by NVIDIA. It's a fully-parallel transformer architecture with prosody control over pitch and individual phoneme duration¹. Here are some key features:
- Fully-Parallel Architecture: Unlike traditional TTS models that generate speech sequentially, FastPitch generates speech in parallel, which makes it much faster¹.
- Prosody Control: FastPitch allows for control over the pitch and duration of individual phonemes, which can make the generated speech more expressive and engaging¹.
- Transformer-Based: FastPitch is based on the Transformer architecture, which is known for its efficiency and scalability¹.
- Integration with NeMo: FastPitch can be trained or fine-tuned using NVIDIA's NeMo framework, a generative AI framework built for working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS)².
FastPitch is used for generating mel spectrograms from text, which can then be converted to audio using a vocoder¹. It's trained on the LJSpeech dataset sampled at 22050Hz and has been tested on generating female English voices with an American accent¹. Please note that this model works well with vocoders that were trained on 22050Hz data¹.
Source: Conversation with Bing, 3/30/2024 (1) TTS En FastPitch | NVIDIA NGC. https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_fastpitch. (2) GitHub - NVIDIA/NeMo: NeMo: a framework for generative AI. https://github.com/NVIDIA/NeMo. (3) Google Colab. https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/FastPitch_MixerTTS_Training.ipynb. (4) undefined. https://arxiv.org/abs/2006.06873.
Is there GermanG2P.cs in NeMoOnnxSharp?
FastPitch of NeMo uses a phonemizer for English but doesn't use for German. NeMoOnnxSharp doesn't contain German phonemizer.