whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Real-time streaming

Open ggerganov opened this issue 1 year ago • 2 comments

[WIP in progress]

With the idea in #137 it is possible to reduce the time in the encoder multiple times. This is beneficial for the stream example, because it already processes the audio in short chunks. The decoding quality seems to drop, but I think not significantly.

With the current parameters, I am able to run the following commands in real-time on MacBook M1 Pro:

# real-time transcription with a step of 1.5 seconds using "medium.en"
./stream -m ./models/ggml-medium.en.bin -t 8 --step 1500 --length 7500

# real-time translation with a step of 2.5 seconds using "large"
./stream -m ./models/ggml-large.bin -t 8 --step 2500 --length 7500 --language bg --translate

This was not possible before.

Next thing to try is to run the tiny model in streaming mode in the browser using WASM with a step of 1 or 2 seconds. I think there is some chance it could actually work.

ggerganov avatar Nov 11 '22 20:11 ggerganov

The performance gain here is absurd. Is there anything I can pitch in here to finalize this PR @ggerganov? I could not exactly get the issue with the last commit "...stitch encoder outputs together" 😅

meakbiyik avatar Nov 17 '22 15:11 meakbiyik

The "stitching" is basically instead of running 10 seconds of audio through the encoder at one pass, run for example 5 x 2 second chunks and combine the results in the cross-attention layer to get effectively what we would have gotten with 10 seconds directly. This would allow to process audio more often and be more real-time.

The PR missies an option to enable/disable the encoder truncation - I currently hardcoded the values. It's not difficult to finalise, but I want to see how I will use it in the streaming examples - probably will get a better idea for the API.

ggerganov avatar Nov 17 '22 19:11 ggerganov

@meakbiyik This is now on master. Simply add -ac 512 to the ./stream arguments and you will enable the 3x faster Encoder

ggerganov avatar Nov 20 '22 19:11 ggerganov

Wow, this is great - thanks a lot @ggerganov!

A quick follow-up question: would you recommend 2x speedup or reducing audio context size? Or can I mix them up, what was your experience? I do not quite understand why reducing the audio context should also reduce transcription accuracy, so I cannot be sure 😅

Also, interestingly I have noted that lowering step size improves much better transcription, so much so that using low step size + base model is better than 2x step size + small model. Is there anything going on behind the scenes that can explain this phenomenon? Does the option -kc / keep context play any role here?

meakbiyik avatar Nov 20 '22 19:11 meakbiyik

The 2x speed-up does not seem very useful yet in my experience, so I don't recommend using it. The smaller audio context intuitively is worse compared to the full context because you are analysing less data - less data means worse results.

The step size observation is strange - as long as your hardware is capable to process the data in real time, then the bigger model should be always better, regardless of step size. Regarding the -kc flag - I don't use it for stream because errors occur more often when doing real-time stream and the -kc flag can actually help propagate the errors in the future transcription.

ggerganov avatar Nov 20 '22 19:11 ggerganov

interesting, but why is there less data, particularly if the --length parameter is set less than the context? What I assumed was that --length amount of data is used (if available), and the rest is padded with zeros, therefore if we reduce the audio_context so that --length fits there snugly, there should be no issues. I feel like I totally misunderstood some of these parameters 😅

On step size, I observed that the transcription is "refined" every time the model reruns on the data it has already seen, and more refinements are better, which makes sense if the model has access to the current context of size --length.

-kc part makes sense. I actually plan the create a PR to inject arbitrary contexts as you recommended in some previous PR, but let's see what happens 😄

meakbiyik avatar Nov 20 '22 19:11 meakbiyik

Yeah, actually you have a good point - for a fixed --length, if the context is bigger, then it shouldn't affect the quality. For example -ac 512 corresponds to a little more than 10s context, so for --length 10000 or less you should be getting the same quality. Your understanding is correct.

On step size, I observed that the transcription is "refined" every time the model reruns on the data it has already seen, and more refinements are better, which makes sense if the model has access to the current context of size --length.

Yes, correct. For example, if you have --length of 10s, then regardless if the step is 1s, 2s, 3s, etc.. the final pass when it processes the full 10s chunk will give the same result. Actually, I now realise that you must not use -kc if the --step is smaller than --length because it will be using the "partial" transcription as text context for the next step and it will definitely get messy.

The -kc option has to be reworked as you suggest to be able to provide the context from the previous --length pass for each step of the current --length pass. Feel free to give it a shot and don't hesitate to ask if you have any questions.

ggerganov avatar Nov 20 '22 20:11 ggerganov

Perfect, thanks a lot, all of this makes full sense! Will try to do that -kc thing quite soon.

Buuut I got one final follow-up just to understand it better: what happens if length>audio_context? Does the model it trim from the end? Or is there a downsampling going on?

meakbiyik avatar Nov 20 '22 20:11 meakbiyik

Currently, it will trim from the end:

https://github.com/ggerganov/whisper.cpp/blob/f2df9bd7689475e73da6480212c1a0e6aa348979/whisper.cpp#L1103-L1104

ggerganov avatar Nov 20 '22 20:11 ggerganov

A-ha, lovely. Thanks a lot again!

meakbiyik avatar Nov 20 '22 20:11 meakbiyik

According to #137 , I set -ac = 750, the result have lots of noise word “[buzzer] / [static] / [AUDIO OUT]”, how can I remove it? BTW,it's works well use src set audio_ctx= 0. image

xyx361100238 avatar Dec 08 '22 04:12 xyx361100238

Currently, the only way is to manually replace these strings yourself (for example, using regex). Btw, -ac 768 is better than -ac 750 - you want the number to be multiple of 64 for better performance.

ggerganov avatar Dec 08 '22 07:12 ggerganov

Yes Yes!Much better set -ac 768 : image add i will replace strings too. Thanks again!

xyx361100238 avatar Dec 09 '22 01:12 xyx361100238