whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Faster streaming support

Open ameenba opened this issue 1 year ago • 27 comments

Have you tried building the spectrogram and encoder output in smaller chunks and appending? I think the spectrogram should generate fairly easily with minimal noise depending on the size of the chunk, and the encoder output can also be appended with sufficiently large chunks.

So the encoder instead takes in Bx80xN as its input and outputs Bx(N/2)x(embedding size), if you wanted to send in 1s of audio into the tiny.en model, for example, 1x80x100->1x50x384. This should result in much faster processing for short clips (when the audio clip is <30s), and allows real time streaming without much wasted computation (like having to calculate a full x1500 encoding for each chunk of audio).

Some noise may be introduced at various size of chunks (spectrogram chunk size can be independent from encoder chunk size), and some overlap of spectrogram input/encoder output may help further to reduce that noise. This allows us for better scheduled deployment where the decoder, encoder, and spectrogram can run on different threads at the same time to produce our transcription.

Choosing when to decode will be another challenge as you don't want to decode if a full word is not complete in the encoding, but there are definitely solutions around that as well.

ameenba avatar Nov 10 '22 15:11 ameenba

Oh my god! I just tested that and it seems to work o.O I reduced the audio context by half and the performance doubled. jfk.wav transcribed correctly!

Very interesting... I have to go now, but I think this is a big breakthrough. Need to double check.

ggerganov avatar Nov 10 '22 15:11 ggerganov

Here is diff if someone wants to play with it:

git diff
diff --git a/whisper.cpp b/whisper.cpp
index 7078863..df47bff 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -1053,6 +1053,7 @@ static bool whisper_model_load(const std::string & fname, whisper_context & wctx
             return false;
         }
     }
+    model.e_pe->ne[1] /= 2;
 
     fin.close();
 
@@ -1076,7 +1077,7 @@ static bool whisper_encode(
     const auto & mel_inp = wctx.mel;
     const auto & hparams = model.hparams;
 
-    const int n_ctx   = hparams.n_audio_ctx;
+    const int n_ctx   = hparams.n_audio_ctx/2;
     const int n_state = hparams.n_audio_state;
     const int n_head  = hparams.n_audio_head;
     const int n_layer = hparams.n_audio_layer;
@@ -1474,7 +1475,7 @@ static bool whisper_decode(
     const int n_layer = hparams.n_text_layer;
 
     const int N = n_tokens;
-    const int M = hparams.n_audio_ctx;
+    const int M = hparams.n_audio_ctx/2;
 
     struct ggml_init_params params = {
             .mem_size   = wctx.buf_compute.size(),

ggerganov avatar Nov 10 '22 15:11 ggerganov

@ggerganov amazing work on this project 👍 Another demo idea: Typing comments while doing an asciinema cast is so 2021...

gitslav avatar Nov 11 '22 16:11 gitslav