whisper.cpp
whisper.cpp copied to clipboard
Real-time streaming
[WIP in progress]
With the idea in #137 it is possible to reduce the time in the encoder multiple times.
This is beneficial for the stream
example, because it already processes the audio in short chunks.
The decoding quality seems to drop, but I think not significantly.
With the current parameters, I am able to run the following commands in real-time on MacBook M1 Pro:
# real-time transcription with a step of 1.5 seconds using "medium.en"
./stream -m ./models/ggml-medium.en.bin -t 8 --step 1500 --length 7500
# real-time translation with a step of 2.5 seconds using "large"
./stream -m ./models/ggml-large.bin -t 8 --step 2500 --length 7500 --language bg --translate
This was not possible before.
Next thing to try is to run the tiny
model in streaming mode in the browser using WASM with a step of 1 or 2 seconds.
I think there is some chance it could actually work.
The performance gain here is absurd. Is there anything I can pitch in here to finalize this PR @ggerganov? I could not exactly get the issue with the last commit "...stitch encoder outputs together" 😅
The "stitching" is basically instead of running 10 seconds of audio through the encoder at one pass, run for example 5 x 2 second chunks and combine the results in the cross-attention layer to get effectively what we would have gotten with 10 seconds directly. This would allow to process audio more often and be more real-time.
The PR missies an option to enable/disable the encoder truncation - I currently hardcoded the values. It's not difficult to finalise, but I want to see how I will use it in the streaming examples - probably will get a better idea for the API.
@meakbiyik
This is now on master.
Simply add -ac 512
to the ./stream
arguments and you will enable the 3x faster Encoder
Wow, this is great - thanks a lot @ggerganov!
A quick follow-up question: would you recommend 2x speedup or reducing audio context size? Or can I mix them up, what was your experience? I do not quite understand why reducing the audio context should also reduce transcription accuracy, so I cannot be sure 😅
Also, interestingly I have noted that lowering step size improves much better transcription, so much so that using low step size + base model is better than 2x step size + small model. Is there anything going on behind the scenes that can explain this phenomenon? Does the option -kc / keep context
play any role here?
The 2x speed-up does not seem very useful yet in my experience, so I don't recommend using it. The smaller audio context intuitively is worse compared to the full context because you are analysing less data - less data means worse results.
The step size observation is strange - as long as your hardware is capable to process the data in real time, then the bigger model should be always better, regardless of step size. Regarding the -kc
flag - I don't use it for stream
because errors occur more often when doing real-time stream and the -kc
flag can actually help propagate the errors in the future transcription.
interesting, but why is there less data, particularly if the --length
parameter is set less than the context? What I assumed was that --length
amount of data is used (if available), and the rest is padded with zeros, therefore if we reduce the audio_context so that --length
fits there snugly, there should be no issues. I feel like I totally misunderstood some of these parameters 😅
On step size, I observed that the transcription is "refined" every time the model reruns on the data it has already seen, and more refinements are better, which makes sense if the model has access to the current context of size --length
.
-kc
part makes sense. I actually plan the create a PR to inject arbitrary contexts as you recommended in some previous PR, but let's see what happens 😄
Yeah, actually you have a good point - for a fixed --length
, if the context is bigger, then it shouldn't affect the quality. For example -ac 512
corresponds to a little more than 10s
context, so for --length 10000
or less you should be getting the same quality. Your understanding is correct.
On step size, I observed that the transcription is "refined" every time the model reruns on the data it has already seen, and more refinements are better, which makes sense if the model has access to the current context of size --length.
Yes, correct. For example, if you have --length
of 10s
, then regardless if the step is 1s
, 2s
, 3s
, etc.. the final pass when it processes the full 10s
chunk will give the same result. Actually, I now realise that you must not use -kc
if the --step
is smaller than --length
because it will be using the "partial" transcription as text context for the next step and it will definitely get messy.
The -kc
option has to be reworked as you suggest to be able to provide the context from the previous --length
pass for each step of the current --length
pass. Feel free to give it a shot and don't hesitate to ask if you have any questions.
Perfect, thanks a lot, all of this makes full sense! Will try to do that -kc thing quite soon.
Buuut I got one final follow-up just to understand it better: what happens if length>audio_context? Does the model it trim from the end? Or is there a downsampling going on?
Currently, it will trim from the end:
https://github.com/ggerganov/whisper.cpp/blob/f2df9bd7689475e73da6480212c1a0e6aa348979/whisper.cpp#L1103-L1104
A-ha, lovely. Thanks a lot again!
According to #137 , I set -ac = 750, the result have lots of noise word “[buzzer] / [static] / [AUDIO OUT]”, how can I remove it?
BTW,it's works well use src set audio_ctx= 0.
Currently, the only way is to manually replace these strings yourself (for example, using regex).
Btw, -ac 768
is better than -ac 750
- you want the number to be multiple of 64 for better performance.
Yes Yes!Much better set -ac 768 :
add i will replace strings too. Thanks again!