parler-tts icon indicating copy to clipboard operation
parler-tts copied to clipboard

Custom pronunciation for words - any thoughts / recommendations about how best to handle them?

Open nmstoker opened this issue 9 months ago • 2 comments

Hello! This is a really interesting looking project.

Currently there doesn't seem any way that users can help the model correctly pronounce custom words - for instance JPEG is something that speakers just need to know is broken down as "Jay-Peg" rather than Jay-Pea-Ee-Gee.

I appreciate this project is at an early stage but for practical uses, especially with brands and product names often having quirky ways of saying words or inventing completely new words, it's essential to be able to handle their correct pronunciation on some sort of override basis. It's not just brands - plenty of people's names need custom handling and quite a few novel computer words are non-obvious too.

Examples that cause problems in the current models: Cillian, Joaquin, Deirdre, Versace, Tag Heuer, Givenchy, gigabytes, RAM, MPEG etc.

Are there any suggestions on how best to tackle this?

I saw there was #33 which uses a normaliser specifically for numbers. Is there something similar for custom words? I suppose perhaps one could drop in a list of custom words and some sort of mapping to the desired pronunciation, applying that as a stage similar to how it handles abbreviations.

In espeak backed tools, it's sometimes possible to replace words with custom IPA that replaces the default IPA generated but I believe this model doesn't use IPA for controlling pronunciation.

Given the frequently varying pronunciations, I doubt that simply finetuning to include the words would be a viable approach.

Anyway, would be great to hear what others have to recommend.

Incidentally certain mainstream terms also get completely garbled, it seems impossible to get Instagram, Linux or Wikipedia to be spoken properly, but that's more a training data issue and those are mainstream enough that you wouldn't need to cover them via custom overrides.

nmstoker avatar May 12 '24 15:05 nmstoker

Also, maybe best as a separate issue, but heteronyms are worth consideration too for practical uses.

These can't be handled by a trivial lookup since they can vary even within the same sentence depending on context:

"They had a row about exactly whose turn it was to row the boat."

  • first "row" is said as in now
  • second "row" is said as in know

nmstoker avatar May 12 '24 15:05 nmstoker

I would also be very excited if IPA could be supported - although the current models are a major advance there are still a large number of words which are garbled or mispronounced and if speech could be modified with IPA it would offer the best of both worlds. Presumably this should be relatively simple to achieve by training the model with a corpus for which the text had been preconverted to IPA? eg with eng-to-ipa (or alternative localised pronounciation) https://pypi.org/project/eng-to-ipa/

dgm3333 avatar Aug 14 '24 21:08 dgm3333

Is there any update on this?

sankar-mukherjee avatar Oct 15 '24 00:10 sankar-mukherjee