WhisperLiveKit WLK with SimulStreaming as Backend has strong performance issues after ~1:30mins

Hey there, I am currently testing WLK to see if it would make sense for us to use it. We're looking for something that can assist hearing impaired peers participate better in (video)conferences.

I have a workstation with 64GBs of RAM, an RTX 4000 Ada (Laptop) GPU with 12GBs of VRAM and a 14900HX as CPU. Unfortunately, we're forced to use WIndows in Production for Clients. We currently plan to run it on-client directly.

I have observed that, no matter if it's large-v3-turbo oder tiny, the VRAM-Usage with simulstreaming goes from steady 4.5-5GBs up until around 1:20mins right up to 11-11.5GBs within 20 seconds.

When I use localArrangment instead of simulstreaming with the default settings, VRAM Usage is stable, there is a bit more GPU Utilization, the total speed is lower though.

With simulstreaming the lag before the VRAM Spikes is 0.2-0.5s as indicated by the UI. When the VRAM is filled, it just get's worse and worse.

For localArrangement, lag is constant, but usually above 1s.

Any Ideas on how to improve this? We're trying to get realtime transcritpion with the best possible accuracy. We want to run it on client.

Currently I am working with the following command: wlk --model medium --language de --diarization --diarization-backend sortformer --backend-policy simulstreaming --beams 1 --frame-threshold 20 --audio-max-len 10 --preload-model-count 1 --disable-punctuation-split

However, the same issue occurs with the following command too: wlk --model medium --language de --backend-policy simulstreaming --beams 1 --frame-threshold 20 --audio-max-len 10 --preload-model-count 1 -l DEBUG

(Turning off diarization didn't help too much, and it's something that is important to my colleague)

Thanks in advance and thank all of you for the work you've invested in this!

Nov 24 '25 16:11 AeneasChristodoulou

Also, it tends to repeat a word or phrase endlessly, for minutes at end. Even if you don't say anything.

Nov 24 '25 16:11 AeneasChristodoulou

hi, thank you for the feedback

Regarding the word-repeat issue, I am already working on it, I've identified the root cause

For the speed/vram issue, could I ask a few clarifications to better understand the pattern?

When you say the vram usage starts spiking after 1m20, is that consistent? (e.g., always around 1:15 - 1:30, or does it vary a lot?)
Does something specific happen around that moment in your setup? For example: a new speaker starts talking, a long silence, a different audio device, or anything else that could trigger a buffer/state reset or a new model loading ? Next version will not require anymore --preload-model-count, so if that is that, that might solve the problem
Does the vram spike happen with any audio input, or are you doing your tests on a specific pre recorded audio?

Thanks again for taking the time to test WLK!

Nov 25 '25 14:11 QuentinFuxa

Hi Quentin! Thanks again for you work on this! Really appreciate it!

The VRAM tends to idle at 4.5-5GB until 1:15-1:30. Usually around 1:20 it starts going up slowly, closing in on 6GB. Then it starts accelerating, 7, 7.5, 8, 8.5 etc. within 20 seconds it goes from 4.5 to 11. The growth in VRAM Need isn't linear. The growth suddenly speeds up.

I saw this beahviour when just talking into the mic to see the accuracy etc. It was a more or less nonstop stream of talking into it with a usually pretty steady WPM I guess.

I also tried it with a colleague (with impaired hearing and hearing aids). There I noticed the diarization behaved weirdly and observed the same VRAM Behaviour. It felt more like it switched the "speaker" every time you did a pause of more than a second. Maybe i misunderstood the diarization feature so far?

I just tested just silence. When there is just silence, there is no VRAM Issue. I have tested doing a bit of smalltalk (a few sentences) with a colleague with just silence afterwards. No issue.

I just let it sit idle in background without anyone speaking whatsoever. However, once i said a few more sentences some minutes later, asking my colleague about a technicality, it triggered the VRAM issue! And with the VRAM Issue also came the word repetitions, even though I didn't say anything anymore.

Attached is a screenshot of what happened earlier with a timeline.

So, to conclude, I believe it has something to do with the amount of tokens/context it is fed with?

I hope this helps! Best regards!

P.S: I haven't yet tested how it reacts with simulated input (.wav-file etc.), we have just tried different scenarios of literally our Workstations Microphone eavesdropping. I only found out about WLK on friday, and setting it up on Windows also took a bit. Hence I am not yet into it super deeply!

Nov 25 '25 15:11 AeneasChristodoulou

Alright! The situation improves drastically once i introduce wlk ... --max-context-tokens 100 or 125 etc. Interestingly, it did happen again, but only after quite some time of speech (with diarization). WHen it happened, it looped. SO maybe the fix you're working on for looping will fix this too to a certain degree?

From what I can tell, the VRAM spike seems to happen because SimulStreaming keeps building up internal state during longer sessions. The KV-cache grows with every decoded token across all the layers, the audio segments keep piling up until a pretty late cleanup point and the attention matrixes also get stored every step. After around 1:30mins, all of this seems to add up and push VRAM usage. Does that sound sound?

LocalAgreement doesn’t show the same jump, probably because it processes audio in separate chunks and resets things between them???

P.S: I managed to trigger the looping issue without talking. That immediately, within seconds triggered the VRAM Issue PPS: Works for 12 mins on medium with max-context-tokens 125 and diarization without issue, Used a podcast episode to test it. I'll try large-v3-turbo next. PPPS: large-v3-turbo worked until 6 minutes until the looping happened. I have also noticed, that diarization without --disable-punctuation-split is better because there is no ...s inbetween words, but it switches around very quickly. That leads to words commonly being detected (in grey) but being dropped in the UI.

Nov 25 '25 16:11 AeneasChristodoulou

Thanks a lot for all these details

So, localAgreement is much simpler internally, it doesn’t interact with the model’s internal cache/state the same way as AlignAtt does. That’s why it doesn’t show the same VRAM spike. But it is slower too.

Great catch on --max-context-tokens. By default it depends on the Whisper model and is usually around 448, so reducing it naturally limits kv cache growth, it's probably why you do not see the problem anymore.

About diarization: since 0.2.15, --disable-punctuation-split is no longer connected (I forgot to remove it from the README). So the "…" behaviour is basically random noise/due to initialisation. early audio frames can vary a lot and diarization is quite sensitive there. The same for whisper.

You can also try again with 0.2.16.dev0: it includes fixes around model reloading which might help with the late session VRAM spike you’re seeing.

Nov 25 '25 22:11 QuentinFuxa

Interesting! Thanks for the insight! I'll give 0.2.16.dev0 a try then...

Different question on the side, do you have any recommendations for merging mic-input from a user with incoming audio from let's say a video call, so that the users voice is transcribed aswell as the voice of the other patricipants?

Nov 26 '25 08:11 AeneasChristodoulou

Hi! First of all, thanks for the repo and all the great work, amazing job! Second, Im experiencing the same issue as described in the ticket, so I’m curious about its current status and solution. :)

Ive tried using --max-context-tokens, but I still run into the issue, just slightly later. Ive also been running --diarization simultaneously, and I can see that the diarization_lag is constantly increasing. That might be unrelated and might already be reported as a separate issue, but I just wanted to mention it.

Nov 28 '25 06:11 jantonj

Interesting! Thanks for the insight! I'll give 0.2.16.dev0 a try then...

Different question on the side, do you have any recommendations for merging mic-input from a user with incoming audio from let's say a video call, so that the users voice is transcribed aswell as the voice of the other patricipants?

You would need a loopback library to record the output of an application/of the computer If you use Windows, you could use https://github.com/s0d3s/PyAudioWPatch for instance

I do not use Windows so I cannot test it, but if you have feedback with it, I am willing to hear!

Nov 28 '25 16:11 QuentinFuxa

Hi! First of all, thanks for the repo and all the great work, amazing job! Second, Im experiencing the same issue as described in the ticket, so I’m curious about its current status and solution. :)

Ive tried using --max-context-tokens, but I still run into the issue, just slightly later. Ive also been running --diarization simultaneously, and I can see that the diarization_lag is constantly increasing. That might be unrelated and might already be reported as a separate issue, but I just wanted to mention it.

Ok thank you for the feedback. I am surprise for the diarization_lag, it's a lightweight model, are you using the 0.2.16 version of WLK ?

Nov 28 '25 16:11 QuentinFuxa

Hi, No i havnt had time to test 0.2.16 yet. Just 0.2.15. I do not get the diarization_lag in the log, but get updates in the interface.

Do you want a separate issue related to this? It might not be related.

FYI: I use restore_from() in _load_model(), SortformerDiarization and using diar_streaming_sortformer_4spk-v2.nemo . I think that this should be the same model thats used in a online system and that it shouldnt effect this issue.

Dec 02 '25 07:12 jantonj