whisperX
whisperX copied to clipboard
CUDA failed with error out of memory
Hello WhisperX developers. Thanks for open-sourcing this code.
Since another similar question was closed without an answer I will repeat it.
I'm getting an error CUDA failed with error out of memory
at the diarization step.
it happens on just some files. Which is not too long. Just 11+ minutes.
The setups I've tried were. GPU: nvidia-tesla-t4 and nvidia-tesla-l4 GPU count: 1 and two GPU mem: 16GB per GPU Batch size: 32, 24, 20, and 16* Model: large-v2** Compute types: int8 and float16
*with batch size 16 transcription takes over a minute and timesout. We run it on Googles VertexAI which has a hard limit on prediction duration
** Other models oftentimes produce weird results replacing multiple words with the same token. Like 7 7 7 7 7...
or with with with......
. Plain Whisper is also subject to the same issue.
Reducing the batch size and using smaller model would have been a solution but considering these limitation we can not go this path.
I've also tried allocating the allignment and diarization models on the CPU. But that had no effect.
I've tried adding garbage collection torch.cuda.empty_cache()
. But that also did not help.
Please share any ideas of further improvements I should try. Thank you.
Same issue. del model and gc.collect() doesnt completely free up the GPU. I am looking into it.
same issue +1
I have also noticed GPU memory not getting freed after each inference. Is there any way to clear GPU memory efficiently after each inference run?
The way I found to solve the issue is to delete whisperx and reimport it on every file. PFA the code
def whisperx_trans(audio_file):
import whisperx
model = whisperx.load_model("large-v2", "cuda", asr_options={"suppress_tokens":[-1]+number_tokens})
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment
model_a, metadata = whisperx.load_align_model(language_code="en", device=device)
result2 = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
print(result["segments"]) # after alignment
diarize_model = whisperx.DiarizationPipeline(use_auth_token="<API_Key>", device=device)
diarize_segments = diarize_model(audio_file)
result4 = whisperx.assign_word_speakers(diarize_segments, result2)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs
print(diarize_segments.head())
print(diarize_segments.columns)
del model
del model_a
del diarize_model
del whisperx
import torch
torch.cuda.empty_cache()
del torch
gc.collect()
return result4, diarize_segments
Same issue, except it kicks in directly at transcription and affects all my production files. My error message:
Traceback (most recent call last):
Cell In[33], line 16
model = whisperx.load_model(my_model, device, compute_type = compute_type, language = my_language)
File C:\ProgramData\miniconda3\envs\transcription\lib\site-packages\whisperx\asr.py:288 in load_model
model = model or WhisperModel(whisper_arch,
File C:\ProgramData\miniconda3\envs\transcription\lib\site-packages\faster_whisper\transcribe.py:130 in __init__
self.model = ctranslate2.models.Whisper(
RuntimeError: CUDA failed with error out of memory
This happens even with a batch size of 2 (!)
Has anyone found a solution?
Actually it was working fine for same time duration of files for a batch of file, After few hours it started giving this issue
This is my use case:
No diarization yet. It fails at transcription stage. Using catalan => --language ca
. Using GPU (Nvidia).
Running it inside docker with variations of this main command:
docker run --rm --name whisperx-test --gpus all -it -v ${PWD}/data:/app -v ${PWD}/language-models:/.cache ghcr.io/jim60105/whisperx:latest -- --model large --language ca --output_format srt blai-catala-curta.mp3
- I did a file of 1'15" with
--model medium
=> 🟩 Went well - I did the same file of 1'15" with
--model large
=> 🟩 Went well, lasted 39s - I did another file of 35'01" with
--model large
=> 🟥 Failed (RuntimeError: CUDA failed with error out of memory) - I repetaed the same file of 1'15" with
--model large
=> 🟩 Went well, lasted 39s - I repeated the same file of 35'01" with
--model large
=> 🟥 Failed (RuntimeError: CUDA failed with error out of memory) - For this large file of 35'01" I changed to
--model medium
=> 🟥 My computer powered off (maybe GPU overheating? I don't really know)
I want to use the best model (even large-v3 if possible) and I don't mind if it runs slower (chunking, making it with the CPU, giving it some sort of virtual mem, etc.). I don't mind to "wait" but I want "the best transcription".
My questions:
- Does it try to load the whole mp3 file in memory and transcribe it in one shot? I thought it'd be like "streaming" it and transcribing by portions.
- Is this what you call "batch size"?
- In case of yes, is there an overlapping-moving window? How do we know that chunking the input won't cut a word at the middle?
- Is it possible to use some sort of virtualmem? (I priorititze "not hanging" over "speed").
I am very new to WhisperX and I don't know what to do.