faster-whisper icon indicating copy to clipboard operation
faster-whisper copied to clipboard

(WSL2) RuntimeError: CUDA failed with error out of memory

Open coder543 opened this issue 2 years ago • 24 comments

I have seen some other talk of memory leaks (#390), but I'm having a more sporadic, shorter term issue.

I've experienced this on both an RTX 4070 with 12GB VRAM and an RTX 3090 with 24GB VRAM.

Traceback (most recent call last):
  File "/home/coder/.local/bin/whisper-ctranslate2", line 8, in <module>
    sys.exit(main())
  File "/home/coder/.local/lib/python3.8/site-packages/src/whisper_ctranslate2/whisper_ctranslate2.py", line 527, in main
    result = transcribe.inference(
  File "/home/coder/.local/lib/python3.8/site-packages/src/whisper_ctranslate2/transcribe.py", line 164, in inference
    for segment in segments:
  File "/home/coder/.local/lib/python3.8/site-packages/faster_whisper/transcribe.py", line 384, in generate_segments
    ) = self.generate_with_fallback(encoder_output, prompt, tokenizer, options)
  File "/home/coder/.local/lib/python3.8/site-packages/faster_whisper/transcribe.py", line 584, in generate_with_fallback
    result = self.model.generate(
RuntimeError: CUDA failed with error out of memory

Sometimes, while transcribing, I will be using well under half of the GPU's VRAM, and whisper-ctranslate2 will suddenly fail with this error message. I suppose it is attempting to make a single, super large allocation, but it is unclear why.

The traceback is pointing at faster_whisper, which is why I opened the issue here, and not on whisper-ctranslate2, but I suppose the issue could be over there somehow.

I have used Whisper a lot since it was released, and I have used it via the reference implementation, this implementation, and whisper.cpp. This implementation blows the other two out of the water in terms of performance, but it is frustrating to see it fail during transcription for no reason at all.

Neither of the other two Whisper implementations that I've used have ever had this problem even a single time, as best as I can recall.

For additional context, I've been doing all of this under WSL2 under Windows 11. The RAM usage of even the large-v2 model is nowhere near enough to justify running out of VRAM even on a 12GB card, especially when Windows shows that the VRAM never went above like 5GB.

This issue occurs sporadically. I can sometimes get lucky and process many hours of audio without this happening. Other times, a particular file will trip it up nearly continuously.

Today, I was transcribing this video since I like to have better subtitles when watching long form technical content. On one particularly bad attempt, the medium.en model crashed after transcribing the first 2 minutes of the video. On other attempts, I've seen the large-v2 model get through maybe half of the video before crashing. It's not consistent, and I'm not doing anything else of any note on this computer when these issues occur. I was able to use faster_whisper on the small.en model on this video fine. After a few attempts, medium.en made it through without crashing.

As previously stated, I can use the GPU to transcribe this kind of content with any other implementation, and I don't run into the issue. I've also used plenty of other large ML models on this computer just fine, including llama2, SDXL, and more. There's nothing obviously wrong with my computer or software.

I haven't tested any other faster-whisper applications yet, but I've been intending to try whisperX, which looks promising. Unfortunately, this issue is sporadic, as I've said, so unless whisperX happened to crash immediately under testing, it would be very hard to know for sure whether the issue is absent there.

I really don't understand why this is happening, and why no one else is talking about this (which means they may not be encountering it, I suppose). There could be something wrong with my computer, but I have no clue how that would manifest in this one specific way.

coder543 avatar Aug 28 '23 23:08 coder543

What transcription options are you using?

guillaumekln avatar Aug 29 '23 06:08 guillaumekln

The only options I'm passing to whisper-ctranslate2 are --language en and --model large-v2 (or whichever model I'm using at the time).

coder543 avatar Aug 29 '23 14:08 coder543

Not sure what the issue is. 12GB should be plenty of room to run the large-v2 model with the default settings, even considering the possible memory spikes (which can happen when the Whisper model generates garbage outputs).

I have used Whisper a lot since it was released, and I have used it via the reference implementation, this implementation, and whisper.cpp.

Did you also run the reference implementation in WSL2?

I suggest trying this other faster-whisper CLI which should be easy to setup on the Windows host (no need for WSL2):

https://github.com/Purfview/whisper-standalone-win

guillaumekln avatar Aug 29 '23 16:08 guillaumekln

No one reported unusual "out of memory" errors on my repo, so far only in the expected scenarios like small VRAM - large model.

Btw, you can try int8/int8_float32 compute type to reduce memory usage.

Purfview avatar Aug 29 '23 20:08 Purfview

So, I've taken a look at whisper-standalone-win now.

On the upside, it does seem to work without the strange CUDA OOM.

However, in one transcription test of a 15 minute video that I ran twice on each program, it is actually 20-30% slower than whisper-ctranslate2 running under WSL2, when whisper-ctranslate2 isn't crashing. I was surprised at the difference. This is running with identical flags between them both (--language en --model large-v2).

I'm also unable to find the source code for whisper-standlone-win, which is odd when it's likely a very thin wrapper around faster-whisper. The fact that the instructions also say to run it as admin... it feels weird to me. When I tested it, I ran it without admin, and it worked fine. I'm used to doing everything via Linux, so maybe this is just how things are normally done on Windows.

coder543 avatar Aug 31 '23 14:08 coder543

I have researched this CUDA OOM program a number of times, but I actually may have found a relevant link today: https://github.com/microsoft/WSL/issues/8447#issuecomment-1235512935

It sounds like pin_memory=False is necessary in Pytorch to prevent this error from randomly occurring under WSL2.

coder543 avatar Aug 31 '23 14:08 coder543

However, in one transcription test of a 15 minute video that I ran twice on each program, it is actually 20-30% slower than whisper-ctranslate2 running under WSL2

Standalone Faster-Whisper's defaults are not same as whisper-ctranslate2's defaults, plus there is some start-up delay with frozen exe.

I ran it without admin, and it worked fine.

If you don't copy exe to the weird places then you don't need admin.

It sounds like pin_memory=False is necessary in Pytorch to prevent this error...

But faster-whisper doesn't use PyTorch.

Purfview avatar Aug 31 '23 17:08 Purfview

But faster-whisper doesn't use PyTorch.

But it may do the same memory pinning that PyTorch would do with pin_memory=True.

coder543 avatar Aug 31 '23 17:08 coder543

CTranslate2, which is the backend for faster-whisper, does not allocate pinned memory. However, I don't know whether the NVIDIA libraries cuBLAS and cuDNN internally allocate pinned memory or not. I don't think so but I'm not sure.

What cuBLAS and cuDNN versions are installed in your WSL?

guillaumekln avatar Sep 01 '23 08:09 guillaumekln

I am encountering the same problem as previously mentioned, and it's becoming increasingly frustrating. The library was operational and without issues a few days ago, but it has since started failing with a CUDA out of memory error.

Environment & Setup Details:

  • WSL2 Distributions Tested: Debian, Ubuntu (22.04, 20.04, 18.04)
  • The Only Actions Performed: Applied Windows updates and executed apt upgrade commands time to time.
  • CUDA Versions Tested: 11.8, 11.7, 11.6, 11.5
  • CuDNN Versions Tested: 8.9.5.30, 8.9.5.29, 8.9.4.25
  • Installation Methods: Ranged from system-wide official installations to pip installations.
  • Python Versions Tested: 3.8, 3.9, 3.10, 3.11
  • Precision Tested: int8 and float16
  • Model Used: medium_en
  • GPU: RTX 3060 6GB (with over 50% VRAM consistently available)
  • RAM: 32GB

Despite the numerous configurations and environments I've tested, the issue remains unresolved. Any assistance or insights would be greatly appreciated.

lashahub avatar Oct 31 '23 23:10 lashahub

Maybe it's WSL2 issue, try to report it there -> https://github.com/Microsoft/WSL

Btw, why you run it in WSL2?

Purfview avatar Nov 01 '23 00:11 Purfview

Thanks for the advice. It's a part of a larger script I'm coding with a lot of dependencies that are much harder to manage on Windows.

lashahub avatar Nov 03 '23 00:11 lashahub

@coder543 I just fixed the issue (tested it 20 times on a 1 hour audio file) by updating WSL2 version to the pre-released 2.0.7.0 by running wsl --update --pre-release from Windows's command prompt

lashahub avatar Nov 03 '23 01:11 lashahub

@lashahub Nice. I ended up moving all of this machine learning stuff over to a different computer running Ubuntu.

coder543 avatar Nov 03 '23 01:11 coder543

@coder543 I felt the same pain during the last week trying to find a solution, but since my PC can't natively run Linux (thanks to Intel) I had to find a way out. @guillaumekln @Purfview Do you think it would be appropriate to add a comment about the fix in the readme? I know it's a WSL thing, but from all the codes I've been running on CUDA, only faster-whisper was problematic.

lashahub avatar Nov 03 '23 01:11 lashahub

Imo, the issue is not significant to have place on the main page. And my repo users don't use WSL...

@coder543 Could edit the topic and clarify that it's about WSL2, so people could easily find it.

Purfview avatar Nov 03 '23 01:11 Purfview

FWIW I'm also getting the same error on Ubuntu (inside Docker) running on a T4 using large-v2 using float16.

Docker image: FROM nvidia/cuda:11.6.2-cudnn8-runtime-ubuntu20.04

model.transcribe(
        binary_file, language="en", beam_size=5, word_timestamps=True, vad_filter=True
)

I could be doing something hilariously wrong but posting just in case others happen to run into the same thing.

TedTimbrell avatar Nov 06 '23 18:11 TedTimbrell

FWIW I'm also getting the same error on Ubuntu (inside Docker) running on a T4 using large-v2 using float16.

Docker image: FROM nvidia/cuda:11.6.2-cudnn8-runtime-ubuntu20.04

model.transcribe(
        binary_file, language="en", beam_size=5, word_timestamps=True, vad_filter=True
)

I could be doing something hilariously wrong but posting just in case others happen to run into the same thing.

Getting this same issue on a Ubuntu docker system. Did you find any solution for the same?

SharayuChoudhari06 avatar Jun 03 '24 10:06 SharayuChoudhari06

You can try to add condition_on_previous_text=False to transcribe options and check if it can run with that configuration

kalmik avatar Jun 03 '24 11:06 kalmik

I'm seeing this issue, too. I think it's a WSL2 problem, so I posted it on their repo: https://github.com/microsoft/WSL/issues/8447#issuecomment-2334035165

erklem avatar Sep 06 '24 13:09 erklem

I got this error on every model. Only small.en produced a different error message - and I know that I've limited my pagefile size to 2GB yesterday since it had grown to over 21GB

Here's what small.en reported - ImportError: DLL load failed while importing _core: The paging file is too small for this operation to complete.

danested avatar Oct 16 '24 16:10 danested

I have the same problem running it with wsl2 but it works if i use the whisper-standalone-win (windows exe) on the same machine. So seems a WSL2 problem as some of you already said.

in my case:

Traceback (most recent call last):
  File "/root/infer.py", line 6, in <module>
    for segment in segments:
  File "/usr/local/lib/python3.10/dist-packages/faster_whisper/transcribe.py", line 1189, in restore_speech_timestamps
    for segment in segments:
  File "/usr/local/lib/python3.10/dist-packages/faster_whisper/transcribe.py", line 594, in generate_segments
    ) = self.generate_with_fallback(encoder_output, prompt, tokenizer, options)
  File "/usr/local/lib/python3.10/dist-packages/faster_whisper/transcribe.py", line 884, in generate_with_fallback
    result = self.model.generate(
RuntimeError: CUDA failed with error out of memory

I also got the same problem by using the https://github.com/fedirz/faster-whisper-server docker on WSL2 with various models and different options

lluisd avatar Nov 08 '24 13:11 lluisd

@coder543 I just fixed the issue (tested it 20 times on a 1 hour audio file) by updating WSL2 version to the pre-released 2.0.7.0 by running wsl --update --pre-release from Windows's command prompt

Two years later, this still worked for me. That updated my wsl to 2.6.1.0 and the “out of memory” errors stopped.

Edit: Never mind -- turns out that I was testing with a shorter audio stream and it managed to complete without running out of memory. Longer audio streams still cause the "out of memory" error. 😞

jonheese avatar Sep 02 '25 21:09 jonheese

Running on linux with RX7900XT 20GB of vram using tiny.en with ollama llm loaded and tiny.en for fatser whisper with 6GB of vram to spare and I have same error. I was trying to resolve it by running

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.synchronize()

after each request

(took it from some other github issues) seemed to work a bit better, but i am still struggling to make it work with RealtimeSTT which has faster-whisper as dependecy

tk-1001 avatar Sep 10 '25 14:09 tk-1001