candle
candle copied to clipboard
Whisper microphone example outputs gibberish
I am trying to get the Candle Whisper microphone example to work but to no avail.
First, I encountered an issue related to the microphone being repeatedly reacquired, which causes pretty weird behavior on Linux. It turns out lavafroth on GitHub already submitted a PR (https://github.com/huggingface/candle/pull/1864).
However, even with those changes, I am getting gibberish out. For example, here's the output when I say, "Testing... One, two, three."
Perhaps the buffers are not managed correctly, but I don't understand the code enough (yet) to see what might be going on there.
Transcribing audio...
language_token: None
0.0s -- 3.3s: you
1.0s -- 16.4s: .
2.0s -- 38.3s: S***
2.0s -- 38.3s: you
3.0s -- 77.9s: this this thing this thing this thing
3.0s -- 77.9s: testing one two three
4.0s -- 131.2s: this this thing this thing this thing
4.0s -- 131.2s: testing 123 testing 123 123 testing 123 123 testing 123 123
4.0s -- 131.2s: testing 1 2 3
5.0s -- 213.3s: this this thing this thing this thing
5.0s -- 213.3s: testing 123 testing 123 123 testing 123 123 testing 123 123
5.0s -- 213.3s: Testing 123 123 Testing 123 123 Testing 123 123 Testing 123 123
5.0s -- 213.3s: testing 123123
6.0s -- 298.7s: 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1,
6.0s -- 298.7s: testing 123 123 testing 123 123 testing 123 123
6.0s -- 298.7s: Testing 123 123 Testing 123 123 Testing 123 123
6.0s -- 298.7s: testing 123123 testing 123123
6.0s -- 298.7s: testing 123 123 testing 123 123
6.0s -- 298.7s: you
7.0s -- 570.1s: testing 123123
7.0s -- 570.1s: testing 123 123
7.0s -- 570.1s: testing 123 123
same here
same. I can get around the 'flickering' re acquiring of the microphone also by increasing the config from 300 to 5000 but the result is no output or gibberish.
The whisper from .wav file example works fine.
EDIT: just seeing the PR now
also i had to get rid of the quantized model which was taking 100% of my GPU constantly (macbook pro m3 max)
switching to normal model works well <30% GPU with continuous transcription
@louis030195 so it's working for you? I'm using the standard model as in the example code and see gibberish output.
i think whisper output garbage by default for example when there is no speaking (no speech token does not seem enough?)
my code: https://github.com/louis030195/screen-pipe/blob/main/screenpipe-audio/src/core.rs
it captures well voices (even shitty french accent)