whisper-ctranslate2 improve microphone detection

I tested a bit with the live functionality and I noticed that the voice recognition did not record properly in some cases because the volume was too low (although I spoke at normal room volume).

I looked at the original code from https://github.com/Nikorasu/LiveWhisper/blob/main/livewhisper.py and there I found the alternative to volume determination.

Of course, it could be due to my microphone, but with my improvement, it worked much more stably afterwards.

I am aware that I can adjust the sensitivity with --live_volume_threshold. But that only helped to a limited extent. I used the print function to output the values to the console what the function np.sqrt(np.mean(indata**2)) and indata.max() return and indata.max() seemed to return slightly higher and more importantly more stable values. So I am submitting this change because I think it is a general improvement.

Oct 25 '23 13:10 RustProfi

Hello

We are currently using:

np.sqrt(np.mean(indata**2))

Which the RMS a wide use approach to understand loudness of the audio.

indata.max() will provide the maximum volume, which can be for example random noise, while RMS actually since uses the mean should provide a more stable number. I think that the current approach is a better option.

Do you have any paper of article that documents indata.max() vs RMS?

Oct 26 '23 16:10 jordimas

Hi Jordi,

Fair enough. No, I don't have a paper or article on this.

I tested the code with RMS and indata.max() on a different machine and a different mic and I still have the same problem.

The main problem is that no matter what I say, the first word is missing most of the time. it gets a little better with indata.max because the speech recognition most likely starts earlier.

Can you make the same observation?

After reading the article you sent me, I agree that rms is the better option. I clapped my hands a bit to try it out and the indata.max version triggered much more often.

I was thinking that this could be improved by constantly buffering half a second or so, and if speech is detected, the buffer is appended in front. i will try that and let you know if it gets better.

Oct 27 '23 14:10 RustProfi

Hi Jordi, I have tested the prebuffering and it now works better than before (at least from my point of view). Feel free to try it out.

I myself can continue with this version. But have a look if you want to apply the changes to the main project.

Ps: If you don't want to use the prebuffer, then maybe you can remove self.prevblock anyway (it doesn't do anything) and I fixed a bug with the buffers_to_process list by using a deque. The bug was that the buffers_to_process are not processed in the correct order if there is more than one entry.

Nov 02 '23 15:11 RustProfi