Hello WhisperX developers. Thanks for open-sourcing this code.

Since another similar question was closed without an answer I will repeat it.

I'm getting an error CUDA failed with error out of memory at the diarization step. it happens on just some files. Which is not too long. Just 11+ minutes.

The setups I've tried were. GPU: nvidia-tesla-t4 and nvidia-tesla-l4 GPU count: 1 and two GPU mem: 16GB per GPU Batch size: 32, 24, 20, and 16* Model: large-v2** Compute types: int8 and float16

*with batch size 16 transcription takes over a minute and timesout. We run it on Googles VertexAI which has a hard limit on prediction duration ** Other models oftentimes produce weird results replacing multiple words with the same token. Like 7 7 7 7 7... or with with with....... Plain Whisper is also subject to the same issue. Reducing the batch size and using smaller model would have been a solution but considering these limitation we can not go this path.

I've also tried allocating the allignment and diarization models on the CPU. But that had no effect. I've tried adding garbage collection torch.cuda.empty_cache(). But that also did not help.

Please share any ideas of further improvements I should try. Thank you.

Jul 26 '23 10:07 zallesov

Same issue. del model and gc.collect() doesnt completely free up the GPU. I am looking into it.

Aug 10 '23 05:08 Ntweat

same issue +1

Nov 10 '23 18:11 simonkuang

I have also noticed GPU memory not getting freed after each inference. Is there any way to clear GPU memory efficiently after each inference run?

Dec 14 '23 03:12 kurianbenoy-sentient

The way I found to solve the issue is to delete whisperx and reimport it on every file. PFA the code

def whisperx_trans(audio_file):
    import whisperx

    model = whisperx.load_model("large-v2", "cuda", asr_options={"suppress_tokens":[-1]+number_tokens})
    audio = whisperx.load_audio(audio_file)
    result = model.transcribe(audio, batch_size=batch_size)
    print(result["segments"]) # before alignment
    model_a, metadata = whisperx.load_align_model(language_code="en", device=device)
    result2 = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

    print(result["segments"]) # after alignment
    diarize_model = whisperx.DiarizationPipeline(use_auth_token="<API_Key>", device=device)

    diarize_segments = diarize_model(audio_file)

    result4 = whisperx.assign_word_speakers(diarize_segments, result2)
    print(diarize_segments)
    print(result["segments"]) #  segments are now assigned speaker IDs

    print(diarize_segments.head())
    print(diarize_segments.columns)
    del model
    del model_a
    del diarize_model
    del whisperx
    import torch 
    torch.cuda.empty_cache()
    del torch
    gc.collect()
    return result4, diarize_segments

Dec 14 '23 05:12 Ntweat

Same issue, except it kicks in directly at transcription and affects all my production files. My error message:

Traceback (most recent call last):

  Cell In[33], line 16
    model = whisperx.load_model(my_model, device, compute_type = compute_type, language = my_language)

  File C:\ProgramData\miniconda3\envs\transcription\lib\site-packages\whisperx\asr.py:288 in load_model
    model = model or WhisperModel(whisper_arch,

  File C:\ProgramData\miniconda3\envs\transcription\lib\site-packages\faster_whisper\transcribe.py:130 in __init__
    self.model = ctranslate2.models.Whisper(

RuntimeError: CUDA failed with error out of memory

This happens even with a batch size of 2 (!)

Has anyone found a solution?

Feb 07 '24 15:02 tomwagstaff-opml

Actually it was working fine for same time duration of files for a batch of file, After few hours it started giving this issue

Apr 25 '24 21:04 gamingflexer

This is my use case:

No diarization yet. It fails at transcription stage. Using catalan => --language ca. Using GPU (Nvidia).

Running it inside docker with variations of this main command:

docker run --rm --name whisperx-test --gpus all -it -v ${PWD}/data:/app -v ${PWD}/language-models:/.cache ghcr.io/jim60105/whisperx:latest -- --model large --language ca --output_format srt blai-catala-curta.mp3

I did a file of 1'15" with --model medium => 🟩 Went well
I did the same file of 1'15" with --model large => 🟩 Went well, lasted 39s
I did another file of 35'01" with --model large => 🟥 Failed (RuntimeError: CUDA failed with error out of memory)
I repetaed the same file of 1'15" with --model large => 🟩 Went well, lasted 39s
I repeated the same file of 35'01" with --model large => 🟥 Failed (RuntimeError: CUDA failed with error out of memory)
For this large file of 35'01" I changed to --model medium => 🟥 My computer powered off (maybe GPU overheating? I don't really know)

I want to use the best model (even large-v3 if possible) and I don't mind if it runs slower (chunking, making it with the CPU, giving it some sort of virtual mem, etc.). I don't mind to "wait" but I want "the best transcription".

My questions:

Does it try to load the whole mp3 file in memory and transcribe it in one shot? I thought it'd be like "streaming" it and transcribing by portions.
Is this what you call "batch size"?
In case of yes, is there an overlapping-moving window? How do we know that chunking the input won't cut a word at the middle?
Is it possible to use some sort of virtualmem? (I priorititze "not hanging" over "speed").

I am very new to WhisperX and I don't know what to do.

Sep 07 '24 19:09 xmontero

whisperX whisperX copied to clipboard

CUDA failed with error out of memory

My questions:

whisperX
whisperX copied to clipboard