snips-issues
snips-issues copied to clipboard
[snips-tts][picotts] Numbers with dots are pronounced inconsistently for en-GB
Version Snips Flow 1.1.2 (0.62.3), default TTS provider picotts Raspbian 9.8
How to reproduce
On a Pi, Snips Flow installed (audioserver and tts at least)
mosquitto_pub -t 'hermes/tts/say' -m '{"text": "I am running Snips version 1.1.2 (0.62.3).", "siteId": "default", "lang": "en"}'
Spoken output
This says: "I am running Snips version first of the first two, zero point six two dot three."
Expected spoken output
I would expect: "I am running Snips version one point one point two, zero point sixty two point three" or something like that.
The issue
The current audio output of numbers with dots is very inconsistent:
- The first version number is pronounced "first of the first..." for no apparent reason.
- In the second version number the first dot is pronounced "dot" and the second dot is pronounced "point". They should be pronounced the same: both "point" or both "dot".
The issue is broader than version numbers. For instance, IP addresses are also pronounced inconsistently: in "192.168.0.1" the first dot is pronounced "point" and the second and third one "dot."
Some background Apparently Snips is using the en-GB language in picotts by default. That made me wonder: how would picotts handle this with the en-US language?
pico2wave -w /tmp/version.wav -l en-US "I am running Snips version 1.1.2 (0.62.3)."
play /tmp/version.wav
This results in the expected spoken output: no weird "first of the first" and a fully consistent pronouncement of the dot.
With the en-US language the dot is also pronounced consistently in IP addresses.
Workaround I could rewrite these version number strings before sending them to the TTS in my app Assistant information, but a platform-wide solution is preferable, because version numbers and IP addresses are quite generic strings that should be pronounced correctly.
Hi @koenvervloesem ,
First, wow, I am impressed by this ticket description.
I have tested your apps today and noticed this issue when asking for the platform version.
About pico issue, unfortunately we are a mere integrator for this TTS.
The lates issue we solved for this one was about an problem with - mistakingly taken for command options.
About en-GB, there is another issue with & pronouced ampersand. Using en-US the problem disappears.
I think en-GB was chosen because it sounds a bit better, on the other hand it seem that en-US has a more elegant way to utter things. (Needs to be discussed internally, keep you posted about that)
About best practices to support various TTS.
The thing is, depending on the TTS, pronunciation can vary. A good way to proceed is to generate explicit sentences if you want it to be universal for a given language.
I have already observed this behavior on other techs (nao robot using accapella or nuance TTS). The best is to transform an input such as "192.168.1.10" into something more explicit such as "My IP address is, 192, point, 168, point, 1, point, 10".
Hi @cpoisson, thanks for looking into this. One of my New Year's resolutions was filing more bug reports for projects I care about. So here I am, filling your repositories with issues ;-)
I'm still struggling with best practices for voice applications. For instance, your approach to generate explicit sentences to 'help' the TTS makes sense on first sight, but what then with multimodal applications? For instance when I have a display on my voice assistant (I believe @philipp2310 and @oziee are working on display integrations for Snips, see https://forum.snips.ai/t/snips-gui-or-visual-display-of-assistant/1861) and I ask for an IP address, I don't want to see "192, point, 168, point, 1, point, 10" on the display...
The ampersand issue that I filed (https://github.com/snipsco/snips-issues/issues/85) is actually in the same situation in the context of a multimodal application: when there's a & in a name, I don't want to see it changed to 'and' on the display.
You have something like 192.168.0.1 saying with tts "one nine two point one six eight....." ??
as i would think tts would just output "one hundred and ninety two (pause) one hundred sixty eight....." because TTS sees a "." as the end of a sentence rather than thinking it has to say point
Like @cpoisson said anything out of the ordinary would need to be converted to a sentence that TTS could understand and say
my calc app converts numbers to words so the tts says the right things As for the screen app i have, my backend code sends what every you set it to send to display..
@koenvervloesem,
IMHO, Dealing with multimodal interaction and how to implement it, each interaction interfaces need somehow to format the information according to the interface constraints before transmitting it to a user.
The constraints here is that the text field in the API is primarily intended for the TTS allowing the dialog to know when to continue or end the session according to the TTS duration.
What you want is to send a message to be transmitted to multiple outputs such as the TTS, Screen display and so on. Following solid principles, It is not recommended to use a component for multiple usages as it will just crumble under the requirements of all the other interfaces colliding with each others.
The best would be to separate the messages, one for each interface.
3 possibles solutions around given the state of the art of Snips platform
A.
- Publish on a message broker or call a function to provide the message to be displayed. (e.g on the screen listening to a topic on MQTT)
- Use hermes protocol to use the TTS and continue your dialog flow (e.g. end or continue)
B.
- Don't use the TTS text field but beware of the time out of the dialog session. (a bit racy)
C.
- Rewrite your own dialog manager to replace snips-dialogue This one could be tough, you need to deal yourself with the burden to toggle on and off the ASR and hotword, take the output of the ASR and feed it to the NLU to retrieve some intents
- As it is low level, there is more risks of breaking changes that could change the behavior of your implementation.
Hi @cpoisson, thanks for your extensive reply! Ok, so it seems to me solution A is the less brittle one. Then it makes sense for Snips apps to 'massage' the text into something that the TTS can speak correctly, while feeding the original, unprocessed text to the display. (But I still would like basic things like version numbers and IP addresses to be pronounced correctly out-of-the-box, like it's done with en-US in picotts.)
Now, @philipp2310 and @oziee, when will you release your display projects, preferably with the same API, so we can integrate our apps into it easily? ;-)
By the way (but now I'm drifting off...), supporting Speech Synthesis Markup Language for the TTS would make it possible to solve these issues more cleanly, as the app creator can then indicate how a specific text fragment should be interpreted (and hence pronounced). A screen display component can then just listen to the same TTS message and then also use this interpretation metadata, e.g. to automatically format a date in another way than default text. This would make multimodal applications possible even without the applications having to support it, and without having to show strange things on the screen like "192, point, 168, point, 1, point, 10".