larynx icon indicating copy to clipboard operation
larynx copied to clipboard

Adding support for Windows Sapi5 implimentation

Open king-dahmanus opened this issue 3 years ago • 8 comments

Hey there developers! I found this repo by exploring, and I'd like to make some requests. Firstly: Releasing a windows sapi5 version of the tts engine, compatible with all the voices that are available, with integrated necessary encoders which ensure a fast and responsive synthesis: Details below. I am a blind person who uses a screen reader to use the computer. Blind people like me require a responsive speech synthesizer so they can recieve the requested information without any unnecessary delays, and a quite poppular part of them require very fast speech output without resulting in weird voice artifacts such as those produced by natural sounding tts voices. If I were stupid and ignorant to the point where I don't realize the hard work for it, I would ask you to make an Nvda addon containing the synthesizer along with a possibility to download the voices, but a more mainstream windows integrated option like sapi5 would maybe a little easier perhaps? Anyway, I know that this project is for rasberry py/commandline usage, but the currently available voices attracted someone like me who uses a more beneficial option for say, dayly usage or something. I look forward to your responce, This is just a request from me, if it can't be done it can't be done. So thanks, and have a good time

king-dahmanus avatar Nov 14 '21 15:11 king-dahmanus

Hi @king-dahmanus, thanks for your feedback! I would definitely be interested in adding SAPI5 support for Windows in order to make Larynx more accessible to everyone. I'll have to look into what it would take in implement a TTS engine interface.

I've experimented with getting the voices much more responsive in my Glow Speak project, which runs a daemon and caches all of the WAV files it produces (it also uses eSpeak to turn text into phonemes). As you mentioned, though, there are weird artifacts for short phrases, especially single words. I believe this is largely a problem with the datasets I have; none of them feature single word utterances, and many of them have sentences split across multiple utterances (so no pauses at the beginning or end).

Do you know of any public audio datasets that contain only complete sentences and single spoken words? If not, would you be interested in collaborating to create one?

synesthesiam avatar Nov 14 '21 16:11 synesthesiam

Hi their. So, about the public datasets, I do not know of anything that exists currently. As for creating one, how would I go about collaborating with you to create the needed datasets? Thanks in advance

On Sun, 14 Nov 2021 at 17:13, Michael Hansen @.***> wrote:

Hi @king-dahmanus https://github.com/king-dahmanus, thanks for your feedback! I would definitely be interested in adding SAPI5 support for Windows in order to make Larynx more accessible to everyone. I'll have to look into what it would take in implement a TTS engine interface.

I've experimented with getting the voices much more responsive in my Glow Speak https://github.com/rhasspy/glow-speak project, which runs a daemon and caches all of the WAV files it produces (it also uses eSpeak to turn text into phonemes). As you mentioned, though, there are weird artifacts for short phrases, especially single words. I believe this is largely a problem with the datasets I have; none of them feature single word utterances, and many of them have sentences split across multiple utterances (so no pauses at the beginning or end).

Do you know of any public audio datasets that contain only complete sentences and single spoken words? If not, would you be interested in collaborating to create one?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rhasspy/larynx/issues/40#issuecomment-968320063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AT2FKJSCIYTFZ56BVFUOWGLUL7NYZANCNFSM5H77FMCQ .

king-dahmanus avatar Nov 14 '21 19:11 king-dahmanus

Do you, or anyone you know, have a pleasant voice, a good microphone, and a lot of patience? :slightly_smiling_face:

I've worked with several people to create text to speech datasets. I use an algorithm to select a (relatively) small set of phonetically diverse sentences from a public domain book or corpus. Here, I would also make sure that we have a diversity of single spoken words.

synesthesiam avatar Nov 17 '21 02:11 synesthesiam

well, I do have a teen voice, and a good quality microphone with some background static noise, and their's nothing there to fix it. But anyway, If you want, Give me a txt file containing the words or sentences I should speak, and I'll make recordings for them and clean them to the best of m ability. Oh also tell me the prefered format of the audio files, and I'll make an archive that has labeled file names of all the sentences and words spoken in it. Thanks!

On Wed, 17 Nov 2021 at 03:56, Michael Hansen @.***> wrote:

Do you, or anyone you know, have a pleasant voice, a good microphone, and a lot of patience? 🙂

I've worked with several people to create text to speech datasets. I use an algorithm to select a (relatively) small set of phonetically diverse sentences from a public domain book or corpus. Here, I would also make sure that we have a diversity of single spoken words.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rhasspy/larynx/issues/40#issuecomment-971128160, or unsubscribe https://github.com/notifications/unsubscribe-auth/AT2FKJSUCLMXW4A3GDUJVDDUMMKXLANCNFSM5H77FMCQ .

king-dahmanus avatar Nov 17 '21 12:11 king-dahmanus

hey, here's a multilanguage dataset I found, It's commonvoice, It claims to me the largest dataset of its kind, Check it out at https://commonvoice.mozilla.org/en/datasets

On Wed, 17 Nov 2021 at 13:28, blind zigzigon @.***> wrote:

well, I do have a teen voice, and a good quality microphone with some background static noise, and their's nothing there to fix it. But anyway, If you want, Give me a txt file containing the words or sentences I should speak, and I'll make recordings for them and clean them to the best of m ability. Oh also tell me the prefered format of the audio files, and I'll make an archive that has labeled file names of all the sentences and words spoken in it. Thanks!

On Wed, 17 Nov 2021 at 03:56, Michael Hansen @.***> wrote:

Do you, or anyone you know, have a pleasant voice, a good microphone, and a lot of patience? 🙂

I've worked with several people to create text to speech datasets. I use an algorithm to select a (relatively) small set of phonetically diverse sentences from a public domain book or corpus. Here, I would also make sure that we have a diversity of single spoken words.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rhasspy/larynx/issues/40#issuecomment-971128160, or unsubscribe https://github.com/notifications/unsubscribe-auth/AT2FKJSUCLMXW4A3GDUJVDDUMMKXLANCNFSM5H77FMCQ .

king-dahmanus avatar Nov 17 '21 13:11 king-dahmanus

The Common Voice datasets are excellent, but not ideal for a text to speech voice. For text to speech, you want a lot of high quality data from very few speakers (no noise, if possible). For speech to text, however, Common Voice is great -- lots of noisy data from many speakers.

Let me look around a bit more before asking you to do any recording. A lot of the text to speech datasets are derived from LibriVox, and I'm hoping there will be a book there where the author reads out lists of items so we can get isolated spoken words.

synesthesiam avatar Nov 19 '21 16:11 synesthesiam

Right, Good luck! I look forward to it, and as I said, if you need anything for me to record please let me know!

On Fri, 19 Nov 2021 at 17:38, Michael Hansen @.***> wrote:

The Common Voice datasets are excellent, but not ideal for a text to speech voice. For text to speech, you want a lot of high quality data from very few speakers (no noise, if possible). For speech to text, however, Common Voice is great -- lots of noisy data from many speakers.

Let me look around a bit more before asking you to do any recording. A lot of the text to speech datasets are derived from LibriVox, and I'm hoping there will be a book there where the author reads out lists of items so we can get isolated spoken words.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rhasspy/larynx/issues/40#issuecomment-974227499, or unsubscribe https://github.com/notifications/unsubscribe-auth/AT2FKJW25YNAWBNOCINVT73UMZ4RHANCNFSM5H77FMCQ .

king-dahmanus avatar Nov 19 '21 20:11 king-dahmanus

Hey, What's new? Are you working on something yet michael? I mean to tell you something. Currently, we could ignore the dataset issue for the moment and concentrate on making this thing support sapi5 on windows. And also, the speed I'm talking about isn't the issue of not being able to pronounce words with the right intonation, but rather being able to speak at very fast speech rates without producing weird artifacts, and also being responsive, so it doesn't have any lag or delay while speaking, so it has to be fast and responsive. Maybe this is already accomplished since it's designed for rasberry py, but still. Thanks, and have a good time

king-dahmanus avatar Nov 28 '21 15:11 king-dahmanus