P-Brain.ai icon indicating copy to clipboard operation
P-Brain.ai copied to clipboard

Contribution Request: Offline speech recognition

Open patrickjquinn opened this issue 8 years ago • 28 comments

Hi guys, I want to include offline speech recognition that's open source to this project, initially just for English. It might be worth investigating the work of XNOR.ai and failing that, building a full, optimised model for PocketSphinx.

Longer term id like this model to to be trained by interactions with the platform and then have some sort of central repository for the model so it can be synced across all instances of the platform.

Any one willing to help? Or have any ideas?

patrickjquinn avatar Jan 28 '17 17:01 patrickjquinn

Willing to do what i can to help! :)

h4ckd0tm3 avatar Jan 28 '17 17:01 h4ckd0tm3

Excellent thanks for the offer, what skills do you posses?

patrickjquinn avatar Jan 28 '17 18:01 patrickjquinn

PHP, HTML, JavaScript, Java, C#....

I graduated at a Secondary Technical College in Engeneering in Austria. Working as Sysadmin. Advanced Linux skills.

Never worked with Node.js before but basic knowlege.

h4ckd0tm3 avatar Jan 28 '17 18:01 h4ckd0tm3

Okay that's perfect, have you ever worked with PocketSphinx or CMUSphinx before?

patrickjquinn avatar Jan 28 '17 18:01 patrickjquinn

@patrickjquinn Which module is currently providing the speech recognition?

Marak avatar Jan 28 '17 18:01 Marak

@patrickjquinn No but looks interesting! And im willing to study this shit xD

@Marak As far as i know annyang

h4ckd0tm3 avatar Jan 28 '17 19:01 h4ckd0tm3

Ah the person behind say! I'm using your fantastic module for the RasPi client!

At the moment, it's done using online APIs ( Google cloud speech and Wit.ai) and node-record-lpc16 to handle speech recognition on the clients.

I've experimented with PocketSphinx but found it to be...too unreliable.

Hence my desire to build something more fit-for task that can be trained dynamically and manually by the community. Open source SST would be a massive coo for the open source community working on projects such as this.

Think you might be able to help out?

patrickjquinn avatar Jan 28 '17 19:01 patrickjquinn

@DevelopingUnicorn excellent :) well I'd suggest you try and get https://github.com/cmusphinx/node-pocketsphinx or https://syl22-00.github.io/pocketsphinx.js/ (both JavaScript bindings for PocketSphinx) recognising speech locally that should be all the research you'll need :) you can contact me via the projects Gitter https://gitter.im/P-Brain/Lobby?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge

patrickjquinn avatar Jan 28 '17 21:01 patrickjquinn

In order to kick this off, i've started a new project; Luther https://github.com/patrickjquinn/Luther (i.e Martin Luther King, i.e Free speech), initially just containing a giant text file of 450+k english words. I'll expand this to a giant list of english sentences, popular musicians and slang words sourced from various different databases of such information.

patrickjquinn avatar Jan 29 '17 18:01 patrickjquinn

Okay guys so tomorrow I'm going to populate Luther with a set up guide for PocketSphinx and a precompiled English dictionary for it. I'll also create a basic node module for recording raw input and isolating the frequencies for human speech which should allow for easier extraction.

Also when the time comes that we have a solid platform, I'll host it on a beefy box "in the cloud" with 20 (I have some spare Azure hosting credits) or so cores so everyone can access it, via a simple API for their projects!

Can anyone who wants to help let me know so I can add them as admins to the Luther project?

patrickjquinn avatar Jan 31 '17 22:01 patrickjquinn

I'll help! Started to playing around with pocketphinx and i'm totally i to this! Looking forward to be a part of this project!

h4ckd0tm3 avatar Feb 01 '17 08:02 h4ckd0tm3

Excellent, i'll add you as an admin! Did you make any progress getting it recognising speech?

patrickjquinn avatar Feb 01 '17 08:02 patrickjquinn

Not by now, here in Austria we have to do military service by the age of 18 and i have 5 Months left so my time is limited. But i hope i'll geht it working by friday!

h4ckd0tm3 avatar Feb 01 '17 14:02 h4ckd0tm3

No rush! It's for your own benefit not mine :) I'll have everything mentioned above commited by tonight

patrickjquinn avatar Feb 01 '17 16:02 patrickjquinn

does XNOR.ai does image/video recognition too? i also think offline should be the priority of this project.

staberas avatar Feb 12 '17 18:02 staberas

Yes indeed they do, but I dont believe they have released anything yet.

While I also believe it should be a priority, its a gigantic task and way beyond the capabilities of any one or two people (Especially if one of those people is me). Basically its not something i can handle alone. Hence this contribution request.

patrickjquinn avatar Feb 12 '17 19:02 patrickjquinn

looking around i found this https://github.com/zzmp/juliusjs which is a 'fork' from this https://github.com/julius-speech/julius speech recognition for ubuntu.

staberas avatar Feb 16 '17 11:02 staberas

Any updates on this? Getting local speech recognition to work right can be hard.

Will we have default support for MacOS?

Looking forward to project updates. This is awesome work being done here!

Marak avatar Mar 02 '17 00:03 Marak

I think we'll almost certainly be using pocketsphinx for speech recognition unless we can find something better. I attempted to get pocketsphinx and nodejs talking to each other last week but nobody maintains the nodejs bindings anymore. To answer your question though, it's almost certain it will be cross-platform compatible as long as all the dependencies support it too.

timstableford avatar Mar 02 '17 08:03 timstableford

Maybe there's some hope from Mozilla's project "Deep Speech" engine? they're claiming 6.5% error rate at this point. https://github.com/mozilla/DeepSpeech

i-am-malaquias avatar May 28 '18 21:05 i-am-malaquias

Iiiinnnteresting....anyone want to attempt to write a nose wrapper for this?

If we can make this work then it makes the project (and some of the modular forks I’ve been working on behind closed doors) more viable vs other open source VAs and we can start more actively maintaining it.

Long terms I’d love to see this or a variant of this as a proper open source Alexa competitor with an open skills ecosystem and companion apps.

patrickjquinn avatar May 28 '18 21:05 patrickjquinn

It sounds like a great idea to me. I think DeepSpeech only processes chunks of audio though? We'd also need to extract those chunks from a stream which is quite a big chunk of work to do right

timstableford avatar May 31 '18 08:05 timstableford

From my research there is a branch for doing real time analysis called “streaming-interface” that should be able to accept a raw stream from the mic. It just requires a rebuild using the build instructions.

The other option is to capture the mic stream after a snowboy keyword is detected and save that stream to a file. Then run DeepSpeech over it and extract the text. Less elegant but should work. On 31 May 2018, 09:31 +0100, Tim Stableford [email protected], wrote:

It sounds like a great idea to me. I think DeepSpeech only processes chunks of audio though? We'd also need to extract those chunks from a stream which is quite a big chunk of work to do right — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

patrickjquinn avatar May 31 '18 08:05 patrickjquinn

I'm mainly thinking about detecting when a command ends. With the first one, does it work like Google's where you tell it to start and then it automatically ends on silence? With the second I think I read that the NodeJS bindings for deep speech and accept an audio buffer so we could at least cut out the filesystem

timstableford avatar May 31 '18 11:05 timstableford

We would have to build a timeout wrapper ourselves (similar to how the RasPi client does it) that automatically closes the mic stream a few seconds after it hears the keyword.

The alternative is to start a timer once the decibel level on the mic drops below a certain threshold (i.e silence).

Which do you think? Any other options that might work here? On 31 May 2018, 12:03 +0100, Tim Stableford [email protected], wrote:

I'm mainly thinking about detecting when a command ends. With the first one, does it work like Google's where you tell it to start and then it automatically ends on silence? With the second I think I read that the NodeJS bindings for deep speech and accept an audio buffer so we could at least cut out the filesystem — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

patrickjquinn avatar May 31 '18 11:05 patrickjquinn

Of those two options I prefer the second, otherwise there'd be a large delay after small commands. I'd really like to do it like in this StackOverflow answer. The problem is that's a lot of work and it may need some input normalisation or the silence threshold to be dynamically set off of maybe an average calculation?

timstableford avatar May 31 '18 12:05 timstableford

I like it. I'll dig around to see what pre-packaged options are available for Node to do this, possibly Ffmpeg.

Otherwise, If we can get a transcription stream piped out of DeepSpeech we can basically start a short timer immediately which is reset when new text is transcribed. So 'hotword'-> 'start timer -> 'user speech'-> 'reset timer' -> 'no user speech' -> 'timer stops mic and re-inits hotword detection'

Thats how I've done it on iOS and it works really well.

On Thu, May 31, 2018 at 1:20 PM Tim Stableford [email protected] wrote:

Of those two options I prefer the second, otherwise there'd be a large delay after small commands. I'd really like to do it like in this StackOverflow answer https://dsp.stackexchange.com/a/17629. The problem is that's a lot of work and it may need some input normalisation or the silence threshold to be dynamically set off of maybe an average calculation?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/patrickjquinn/P-Brain.ai/issues/11#issuecomment-393512172, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvC6pHksaFYF2yUCW9Zrb76csuEya21ks5t39_1gaJpZM4Lwjf0 .

patrickjquinn avatar May 31 '18 14:05 patrickjquinn

Best approach would probably be having a timer after detecting relative silence, but also including a key shortcut or a tap detection or any such mechanism to have the user manually declare they're done, could speed up things a bit.

i-am-malaquias avatar May 31 '18 17:05 i-am-malaquias