whisper.unity icon indicating copy to clipboard operation
whisper.unity copied to clipboard

How to have it always listening without start/stop buttons?

Open ElectroGamesDev opened this issue 1 year ago • 5 comments

I would like to add voice commands to my game, how could I have it always listening without having to click a start and stop button? I did try checking the volume from the mic using audioClip.GetData, but that seems to break after I run microphoneRecord.StartRecord(). How could this be done? Thanks!

ElectroGamesDev avatar Feb 02 '24 06:02 ElectroGamesDev

Right now, there are two ways you can do that:

  1. Use streaming input. Start it once and listen to events OnSegmentUpdated, OnSegmentFinished. Check the example scene here for more details.
  2. Use circular microphone buffer. There is a really good implementation in this PR #52 with build-in commands detection.

The third option would be to use simpler network to detect activation word (like "Alexa" or "Siri") and only start whisper speech recognition after that. However, there is no build-in solution for word spotter.

Macoron avatar Feb 02 '24 12:02 Macoron

Thanks. I checked both of them out, although I'm encountering issues with both solutions.

With the Streaming Input solution, it seems the streaming stops after it being enabled for ~1 minute, so OnStreamFinished I tried stopping the recording and starting the stream and recording so then when ever the streaming stops, it will be started back up, but this caused it to constantly be stopping and starting after the initial streaming stop after ~1 minute. This solution also seemed to freeze the editor every so often. Also after the first few segments, it started taking like 20 seconds to run the OnFinishSegment() despite me only talking for a second and it only taking 1-2 seconds when it was first started.

With the second solution, I tried out the Voice Commands Demo PR, but its very delayed. Sometimes it was taking 2 seconds to complete the inferencing, other times it took 12 seconds, although your test video seems to be nearly instant. I'm sure your PC specs are better than mine, but 12 seconds to inference two words doesn't seem right.

ElectroGamesDev avatar Feb 02 '24 20:02 ElectroGamesDev

With the Streaming Input solution, it seems the streaming stops after it being enabled for ~1 minute, so OnStreamFinished I tried stopping the recording and starting the stream and recording so then when ever the streaming stops, it will be started back up, but this caused it to constantly be stopping and starting after the initial streaming stop after ~1 minute.

Streaming example scene should have Loop mode set to true in MicrophoneRecord script. It allows you to record audio for more than 1 minute (Max Length Sec parameter). Double check if it's set to true.

This solution also seemed to freeze the editor every so often. Also after the first few segments, it started taking like 20 seconds to run the OnFinishSegment() despite me only talking for a second and it only taking 1-2 seconds when it was first started.

What model weights do you use (tiny, base, large, etc)? Could you share your hardware specs? Do you use CPU or GPU inference?

Macoron avatar Feb 02 '24 21:02 Macoron

Ah, I never noticed that Loop option, that should fix the issue.

I'm using the Tiny model, my CPU is a Ryzen 5 2600 and GTX 970 GPU (obviously not the best specs, but it shouldn't deliver such unreliable results like it is), and I'm using what ever is default, I don't see an option to set it to use GPU or CPU.

ElectroGamesDev avatar Feb 03 '24 05:02 ElectroGamesDev

You can try to use CUDA inference. It might be faster on your hardware, but you would need to install CUDA toolkit.

You can also try to enable "Speed Up" setting in WhisperManager script. It could give better performance by slightly reducing quality.

Finally, you can play around with streaming settings, like StepSec or LengthSec in WhisperManager. Maybe you will find configuration which works better for your use case.

Macoron avatar Feb 03 '24 09:02 Macoron