candle Whisper microphone example outputs gibberish

I am trying to get the Candle Whisper microphone example to work but to no avail.

First, I encountered an issue related to the microphone being repeatedly reacquired, which causes pretty weird behavior on Linux. It turns out lavafroth on GitHub already submitted a PR (https://github.com/huggingface/candle/pull/1864).

However, even with those changes, I am getting gibberish out. For example, here's the output when I say, "Testing... One, two, three."

Perhaps the buffers are not managed correctly, but I don't understand the code enough (yet) to see what might be going on there.

Transcribing audio...
language_token: None
0.0s -- 3.3s:  you
1.0s -- 16.4s:  .
2.0s -- 38.3s:  S***
2.0s -- 38.3s:  you
3.0s -- 77.9s:  this this thing this thing this thing
3.0s -- 77.9s:  testing one two three
4.0s -- 131.2s:  this this thing this thing this thing
4.0s -- 131.2s:  testing 123 testing 123 123 testing 123 123 testing 123 123
4.0s -- 131.2s:  testing 1 2 3
5.0s -- 213.3s:  this this thing this thing this thing
5.0s -- 213.3s:  testing 123 testing 123 123 testing 123 123 testing 123 123
5.0s -- 213.3s:  Testing 123 123 Testing 123 123 Testing 123 123 Testing 123 123
5.0s -- 213.3s:  testing 123123
6.0s -- 298.7s:  1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1,
6.0s -- 298.7s:  testing 123 123 testing 123 123 testing 123 123
6.0s -- 298.7s:  Testing 123 123 Testing 123 123 Testing 123 123
6.0s -- 298.7s:  testing 123123 testing 123123
6.0s -- 298.7s:  testing 123 123 testing 123 123
6.0s -- 298.7s:  you
7.0s -- 570.1s:  testing 123123
7.0s -- 570.1s:  testing 123 123
7.0s -- 570.1s:  testing 123 123

May 13 '24 01:05 krzysztofwos

same here

Jul 11 '24 06:07 louis030195

same. I can get around the 'flickering' re acquiring of the microphone also by increasing the config from 300 to 5000 but the result is no output or gibberish.

The whisper from .wav file example works fine.

EDIT: just seeing the PR now

Jul 23 '24 08:07 chris-aeviator

also i had to get rid of the quantized model which was taking 100% of my GPU constantly (macbook pro m3 max)

switching to normal model works well <30% GPU with continuous transcription

Jul 23 '24 10:07 louis030195

@louis030195 so it's working for you? I'm using the standard model as in the example code and see gibberish output.

Jul 23 '24 10:07 chris-aeviator

i think whisper output garbage by default for example when there is no speaking (no speech token does not seem enough?)

my code: https://github.com/louis030195/screen-pipe/blob/main/screenpipe-audio/src/core.rs

it captures well voices (even shitty french accent)

Jul 23 '24 10:07 louis030195

candle candle copied to clipboard

Whisper microphone example outputs gibberish

candle
candle copied to clipboard