transformers.js icon indicating copy to clipboard operation
transformers.js copied to clipboard

Long running transcription using webgpu-whisper

Open iamhitarth opened this issue 1 year ago • 1 comments

Question

Noob question - the webgpu-whisper demo does real time transcription, however it doesn't build out a full transcript from the start ie. 2 mins into transcription, the first few transcribed lines disappear.

Transcript at time x 👇

Cool, let's test this out. We'll see how this works. So turns out that the transcription when I try to access it is actually just empty. And so the only thing that actually comes through is. So yeah, so the output that's getting cut is basically coming from the

Transcript at time x+1 👇

this out, we'll see how this works. So turns out that the transcription when I try to access it is actually just empty. And so the only thing that actually comes through is. So yeah, so the output that's getting cut is basically coming from the work

Note how the "Cool, let's test" is missing from the start of the second transcript.

I'm wondering what it would take to keep building the transcript for a long running meeting without losing any of the previously transcribed stuff?

I tried a naive appending approach and that just results in a transcript full of repetition.

So I'm very curious about what it would take to build out a streaming transcription similar to what something like Deepgram would offer. Would that require a change to the pipeline? Are there models that can take an appended transcript with lots of repetition and trim it down to a clean transcript?

Please let me know if my questions are unclear. Just looking for some direction so that I can potentially put up a PR for this (if needed).

iamhitarth avatar Jun 10 '24 16:06 iamhitarth

Hi there 👋 Indeed, that demo only considers the latest 30 seconds of audio, and was more to showcase the ability of the model to run in real-time with WebGPU. The rest of the pipeline should be implemented by the user, since this is out-of-scope for the transformers.js library (at least for now). I suggest you take a look at this paper, which details a nice way of doing this.

Hope that helps!

xenova avatar Jun 19 '24 09:06 xenova

Hi there 👋 Indeed, that demo only considers the latest 30 seconds of audio, and was more to showcase the ability of the model to run in real-time with WebGPU. The rest of the pipeline should be implemented by the user, since this is out-of-scope for the transformers.js library (at least for now). I suggest you take a look at this paper, which details a nice way of doing this.

Hope that helps!

@xenova Sorry for necro. I have read the whisper-streaming paper and is currently implementing LocalAgreement with n=2 into my hobby application project. While implementing it, I encountered some difficulties. It would be nice if you can point me to the right direction.

My application uses WebGPU Whisper for ASR. Its transcription output is passed to either server-side NLLB-200 or WebGPU NLLB-200 for translation (if #1317 can be fixed), then the translation output is passed to either WebGPU LLM Chat or GPT-4o on cloud for summarization in Markdown note format. However, due to the model architecture of Whisper, the beginning of a sentence will often be included in consecutive audio chunks when streaming, which is then processed multiple time and included in multiple transcription outputs.

Example: T=n: In January 2025 T=n+a: In January 2025, the Chinese company DeepSeek shocked the world. T=n+a+b: In January 2025, the Chinese company DeepSeek shocked the world with the release of R1. T=n+a+b+c: In January 2025, the Chinese company DeepSeek shocked the world with the release of R1. A highly competitive language model that requires only a fraction of the compute of other leading models. Which the output is passed to translation model for four times: 在2025年1月, 在2025年1月, 中國公司DeepSeek以發佈R1震驚全世界, 在2025年1月, 中國公司Deepseek以推出R1震驚全世界, 該公司也公開發佈了R1型號, 在2025年1月, 中國公司Deepseek以推出R1震驚全世界, 還是更令人震驚的是Deepseek與其大部分美國同行不同 Which is not very efficient... And then the repeated texts are passed to summarization LLM, which wastes a lot of unnecessary context and increase the cost of prefilling the prompt.

My plan is to implement LocalAgreement-2 so that it becomes: (Text displayed in gray in UI to indicate unconfirmed by LA-2, white to indicate confirmed) T=n: In January 2025(Gray) T=n+a: In January 2025(White), the Chinese company DeepSeek shocked the world.(Gray) T=n+a+b: In January 2025, the Chinese company DeepSeek shocked the world(White) with the release of R1.(Gray) T=n+a+b+c: In January 2025, the Chinese company DeepSeek shocked the world with the release of R1.(White) A highly competitive language model that requires only a fraction of the compute of other leading models.(Gray) Then for each audio chunk's transcription, if the white part is a complete sentence terminated by either one of {. ! ? ~ ... ......}, then the white part would be passed to NLLB for translation, with the previous complete sentence as the prompt of NLLB. In this way, not only less translations are made, I can also ensure that the translation output would have better understanding of the semantics, as benefited from the longer context of a complete sentence instead of a partial sentence. As a result, the summarization notes by the chat LLM would be more accurate and requires less tokens for prefilling.

However, the problem is that the timestamps tokens are always relative to the current audio chunk and I cannot accurately trim the next audio chunk. Example: T=n: <|0.00|> In January 2025, the<|5.10|> T=n+a: <|0.00|> In January 2025, the Chinese company DeepSeek shocked the world with the release of R1,<|8.66|>... For these two consecutive chunks, the Largest Common Prefix is In January 2025, the, therefore, I will use the <|5.10|> token right after the the token to trim T=n+a+b chunk. This would be fine if (a+b) is negligibly small. But it is not always the case, even when the GPU in use is capable of running in sufficiently high token/s. A even worse case is that: T=n: <|0.00|> In January 2025, the<|5.10|> T=n+a: <|0.00|> In January 2025, a Chinese company called DeepSeek shocked the world with the release of R1,<|8.66|>... In this case, the LCP now becomes In January 2025, , which means the last token of the LCP is now 2025 or , or depending on the whether or not to ignore whitespaces and punctuations. But either way, the token would not have a timestamp token following it. If I trim T=n+a+b with the <|5.10|> timestamp anyway, then it is very likely that the a in T=n+a will be lost.

Here is a video demo to better visualize as my paragraph structuring skill in English may not be as good as I thought: https://github.com/user-attachments/assets/8e219064-279a-44cd-a81c-01f62eb6f0aa

SignOfZeta avatar May 30 '25 05:05 SignOfZeta