VITS fails to synthesize intelligible audio from Punjabi dataset using CISAMPA phoneme input
Hi! I'm training the VITS model on a Punjabi single-speaker dataset using phoneme-level transcriptions written in Case Insensitive Speech Assessment Method Phonetic Alphabet (CISAMPA). The training runs without errors, but the synthesized audio is unintelligible — the words do not sound clear or meaningful, and it seems like phonemes are being mispronounced or skipped altogether. This does not happen when I train the model on the default LJSpeech dataset, which works as expected.
My setup: Dataset: 20 hours of Punjabi single-speaker data Transcriptions: In CISAMPA (already phonemized) Cleaner: I’ve defined a custom cleaner that simply returns the input string:
def phoneme_cleaners(text): return text
Config: cleaners=["phoneme_cleaners"] Phonemes are forward slash separated while words are space-separated. No character-level text is used.
Other configs: Default ljs_base.json
My assumption is that, since the input is already phonemized (in CISAMPA), no text normalization or further cleaning is required — so the cleaner is a pass-through. But I'm unsure if VITS expects phonemes in a specific format, or whether there's some preprocessing that I need to adapt for a non-English phoneme inventory.
Any help or guidance on using custom phoneme inventories with VITS would be greatly appreciated. Thanks!
I would recommend not use the vits because its fairly old architecture , you should try Orpheus tts , use the base hindi model and finetune it on punjabi data