VITS fails to synthesize intelligible audio from Punjabi dataset using CISAMPA phoneme input

Open Fatima-Naseem071 opened this issue 7 months ago • 1 comments

Hi! I'm training the VITS model on a Punjabi single-speaker dataset using phoneme-level transcriptions written in Case Insensitive Speech Assessment Method Phonetic Alphabet (CISAMPA). The training runs without errors, but the synthesized audio is unintelligible — the words do not sound clear or meaningful, and it seems like phonemes are being mispronounced or skipped altogether. This does not happen when I train the model on the default LJSpeech dataset, which works as expected.

My setup: Dataset: 20 hours of Punjabi single-speaker data Transcriptions: In CISAMPA (already phonemized) Cleaner: I’ve defined a custom cleaner that simply returns the input string:

def phoneme_cleaners(text): return text

Config: cleaners=["phoneme_cleaners"] Phonemes are forward slash separated while words are space-separated. No character-level text is used.

Other configs: Default ljs_base.json

My assumption is that, since the input is already phonemized (in CISAMPA), no text normalization or further cleaning is required — so the cleaner is a pass-through. But I'm unsure if VITS expects phonemes in a specific format, or whether there's some preprocessing that I need to adapt for a non-English phoneme inventory.

Any help or guidance on using custom phoneme inventories with VITS would be greatly appreciated. Thanks!

May 05 '25 11:05 Fatima-Naseem071

I would recommend not use the vits because its fairly old architecture , you should try Orpheus tts , use the base hindi model and finetune it on punjabi data

Sep 30 '25 05:09 shekharmeena2896