(WSL2) RuntimeError: CUDA failed with error out of memory
I have seen some other talk of memory leaks (#390), but I'm having a more sporadic, shorter term issue.
I've experienced this on both an RTX 4070 with 12GB VRAM and an RTX 3090 with 24GB VRAM.
Traceback (most recent call last):
File "/home/coder/.local/bin/whisper-ctranslate2", line 8, in <module>
sys.exit(main())
File "/home/coder/.local/lib/python3.8/site-packages/src/whisper_ctranslate2/whisper_ctranslate2.py", line 527, in main
result = transcribe.inference(
File "/home/coder/.local/lib/python3.8/site-packages/src/whisper_ctranslate2/transcribe.py", line 164, in inference
for segment in segments:
File "/home/coder/.local/lib/python3.8/site-packages/faster_whisper/transcribe.py", line 384, in generate_segments
) = self.generate_with_fallback(encoder_output, prompt, tokenizer, options)
File "/home/coder/.local/lib/python3.8/site-packages/faster_whisper/transcribe.py", line 584, in generate_with_fallback
result = self.model.generate(
RuntimeError: CUDA failed with error out of memory
Sometimes, while transcribing, I will be using well under half of the GPU's VRAM, and whisper-ctranslate2 will suddenly fail with this error message. I suppose it is attempting to make a single, super large allocation, but it is unclear why.
The traceback is pointing at faster_whisper, which is why I opened the issue here, and not on whisper-ctranslate2, but I suppose the issue could be over there somehow.
I have used Whisper a lot since it was released, and I have used it via the reference implementation, this implementation, and whisper.cpp. This implementation blows the other two out of the water in terms of performance, but it is frustrating to see it fail during transcription for no reason at all.
Neither of the other two Whisper implementations that I've used have ever had this problem even a single time, as best as I can recall.
For additional context, I've been doing all of this under WSL2 under Windows 11. The RAM usage of even the large-v2 model is nowhere near enough to justify running out of VRAM even on a 12GB card, especially when Windows shows that the VRAM never went above like 5GB.
This issue occurs sporadically. I can sometimes get lucky and process many hours of audio without this happening. Other times, a particular file will trip it up nearly continuously.
Today, I was transcribing this video since I like to have better subtitles when watching long form technical content. On one particularly bad attempt, the medium.en model crashed after transcribing the first 2 minutes of the video. On other attempts, I've seen the large-v2 model get through maybe half of the video before crashing. It's not consistent, and I'm not doing anything else of any note on this computer when these issues occur. I was able to use faster_whisper on the small.en model on this video fine. After a few attempts, medium.en made it through without crashing.
As previously stated, I can use the GPU to transcribe this kind of content with any other implementation, and I don't run into the issue. I've also used plenty of other large ML models on this computer just fine, including llama2, SDXL, and more. There's nothing obviously wrong with my computer or software.
I haven't tested any other faster-whisper applications yet, but I've been intending to try whisperX, which looks promising. Unfortunately, this issue is sporadic, as I've said, so unless whisperX happened to crash immediately under testing, it would be very hard to know for sure whether the issue is absent there.
I really don't understand why this is happening, and why no one else is talking about this (which means they may not be encountering it, I suppose). There could be something wrong with my computer, but I have no clue how that would manifest in this one specific way.
What transcription options are you using?
The only options I'm passing to whisper-ctranslate2 are --language en and --model large-v2 (or whichever model I'm using at the time).
Not sure what the issue is. 12GB should be plenty of room to run the large-v2 model with the default settings, even considering the possible memory spikes (which can happen when the Whisper model generates garbage outputs).
I have used Whisper a lot since it was released, and I have used it via the reference implementation, this implementation, and whisper.cpp.
Did you also run the reference implementation in WSL2?
I suggest trying this other faster-whisper CLI which should be easy to setup on the Windows host (no need for WSL2):
https://github.com/Purfview/whisper-standalone-win
No one reported unusual "out of memory" errors on my repo, so far only in the expected scenarios like small VRAM - large model.
Btw, you can try int8/int8_float32 compute type to reduce memory usage.
So, I've taken a look at whisper-standalone-win now.
On the upside, it does seem to work without the strange CUDA OOM.
However, in one transcription test of a 15 minute video that I ran twice on each program, it is actually 20-30% slower than whisper-ctranslate2 running under WSL2, when whisper-ctranslate2 isn't crashing. I was surprised at the difference. This is running with identical flags between them both (--language en --model large-v2).
I'm also unable to find the source code for whisper-standlone-win, which is odd when it's likely a very thin wrapper around faster-whisper. The fact that the instructions also say to run it as admin... it feels weird to me. When I tested it, I ran it without admin, and it worked fine. I'm used to doing everything via Linux, so maybe this is just how things are normally done on Windows.
I have researched this CUDA OOM program a number of times, but I actually may have found a relevant link today: https://github.com/microsoft/WSL/issues/8447#issuecomment-1235512935
It sounds like pin_memory=False is necessary in Pytorch to prevent this error from randomly occurring under WSL2.
However, in one transcription test of a 15 minute video that I ran twice on each program, it is actually 20-30% slower than
whisper-ctranslate2running under WSL2
Standalone Faster-Whisper's defaults are not same as whisper-ctranslate2's defaults, plus there is some start-up delay with frozen exe.
I ran it without admin, and it worked fine.
If you don't copy exe to the weird places then you don't need admin.
It sounds like pin_memory=False is necessary in Pytorch to prevent this error...
But faster-whisper doesn't use PyTorch.
But faster-whisper doesn't use PyTorch.
But it may do the same memory pinning that PyTorch would do with pin_memory=True.
CTranslate2, which is the backend for faster-whisper, does not allocate pinned memory. However, I don't know whether the NVIDIA libraries cuBLAS and cuDNN internally allocate pinned memory or not. I don't think so but I'm not sure.
What cuBLAS and cuDNN versions are installed in your WSL?
I am encountering the same problem as previously mentioned, and it's becoming increasingly frustrating. The library was operational and without issues a few days ago, but it has since started failing with a CUDA out of memory error.
Environment & Setup Details:
- WSL2 Distributions Tested: Debian, Ubuntu (22.04, 20.04, 18.04)
- The Only Actions Performed: Applied Windows updates and executed
apt upgradecommands time to time. - CUDA Versions Tested: 11.8, 11.7, 11.6, 11.5
- CuDNN Versions Tested: 8.9.5.30, 8.9.5.29, 8.9.4.25
- Installation Methods: Ranged from system-wide official installations to pip installations.
- Python Versions Tested: 3.8, 3.9, 3.10, 3.11
- Precision Tested: int8 and float16
- Model Used:
medium_en - GPU: RTX 3060 6GB (with over 50% VRAM consistently available)
- RAM: 32GB
Despite the numerous configurations and environments I've tested, the issue remains unresolved. Any assistance or insights would be greatly appreciated.
Maybe it's WSL2 issue, try to report it there -> https://github.com/Microsoft/WSL
Btw, why you run it in WSL2?
Thanks for the advice. It's a part of a larger script I'm coding with a lot of dependencies that are much harder to manage on Windows.
@coder543 I just fixed the issue (tested it 20 times on a 1 hour audio file) by updating WSL2 version to the pre-released 2.0.7.0 by running wsl --update --pre-release from Windows's command prompt
@lashahub Nice. I ended up moving all of this machine learning stuff over to a different computer running Ubuntu.
@coder543 I felt the same pain during the last week trying to find a solution, but since my PC can't natively run Linux (thanks to Intel) I had to find a way out. @guillaumekln @Purfview Do you think it would be appropriate to add a comment about the fix in the readme? I know it's a WSL thing, but from all the codes I've been running on CUDA, only faster-whisper was problematic.
Imo, the issue is not significant to have place on the main page. And my repo users don't use WSL...
@coder543 Could edit the topic and clarify that it's about WSL2, so people could easily find it.
FWIW I'm also getting the same error on Ubuntu (inside Docker) running on a T4 using large-v2 using float16.
Docker image: FROM nvidia/cuda:11.6.2-cudnn8-runtime-ubuntu20.04
model.transcribe(
binary_file, language="en", beam_size=5, word_timestamps=True, vad_filter=True
)
I could be doing something hilariously wrong but posting just in case others happen to run into the same thing.
FWIW I'm also getting the same error on Ubuntu (inside Docker) running on a T4 using large-v2 using float16.
Docker image:
FROM nvidia/cuda:11.6.2-cudnn8-runtime-ubuntu20.04model.transcribe( binary_file, language="en", beam_size=5, word_timestamps=True, vad_filter=True )I could be doing something hilariously wrong but posting just in case others happen to run into the same thing.
Getting this same issue on a Ubuntu docker system. Did you find any solution for the same?
You can try to add condition_on_previous_text=False to transcribe options and check if it can run with that configuration
I'm seeing this issue, too. I think it's a WSL2 problem, so I posted it on their repo: https://github.com/microsoft/WSL/issues/8447#issuecomment-2334035165
I got this error on every model. Only small.en produced a different error message - and I know that I've limited my pagefile size to 2GB yesterday since it had grown to over 21GB
Here's what small.en reported - ImportError: DLL load failed while importing _core: The paging file is too small for this operation to complete.
I have the same problem running it with wsl2 but it works if i use the whisper-standalone-win (windows exe) on the same machine. So seems a WSL2 problem as some of you already said.
in my case:
Traceback (most recent call last):
File "/root/infer.py", line 6, in <module>
for segment in segments:
File "/usr/local/lib/python3.10/dist-packages/faster_whisper/transcribe.py", line 1189, in restore_speech_timestamps
for segment in segments:
File "/usr/local/lib/python3.10/dist-packages/faster_whisper/transcribe.py", line 594, in generate_segments
) = self.generate_with_fallback(encoder_output, prompt, tokenizer, options)
File "/usr/local/lib/python3.10/dist-packages/faster_whisper/transcribe.py", line 884, in generate_with_fallback
result = self.model.generate(
RuntimeError: CUDA failed with error out of memory
I also got the same problem by using the https://github.com/fedirz/faster-whisper-server docker on WSL2 with various models and different options
@coder543 I just fixed the issue (tested it 20 times on a 1 hour audio file) by updating WSL2 version to the pre-released
2.0.7.0by runningwsl --update --pre-releasefrom Windows's command prompt
Two years later, this still worked for me. That updated my wsl to 2.6.1.0 and the “out of memory” errors stopped.
Edit: Never mind -- turns out that I was testing with a shorter audio stream and it managed to complete without running out of memory. Longer audio streams still cause the "out of memory" error. 😞
Running on linux with RX7900XT 20GB of vram using tiny.en with ollama llm loaded and tiny.en for fatser whisper with 6GB of vram to spare and I have same error. I was trying to resolve it by running
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
after each request
(took it from some other github issues) seemed to work a bit better, but i am still struggling to make it work with RealtimeSTT which has faster-whisper as dependecy