whisperX icon indicating copy to clipboard operation
whisperX copied to clipboard

[Feature Request] Support M1 Mac's GPU

Open 0x1FFFFF opened this issue 1 year ago • 17 comments

If I pass in mps to device option it will crush. Would be wonderful if M1 GPU can be supported

❯ whisperx assets/test.mp3 --device mps --model large-v2 --vad_filter --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --diarize --hf_token token --language en

torchvision is not available - cannot save figures
Performing VAD...
~~ Transcribing VAD chunk: (01:07:46.006 --> 01:07:50.663) ~~
loc("mps_multiply"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/9e200cfa-7d96-11ed-886f-a23c4f261b56/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":228:0)): error: input types 'tensor<1x1280x3000xf16>' and 'tensor<1xf32>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
[1]    97334 abort      whisperx assets/test.mp3 --device mps --model large-v2 --vad_filter       en
/opt/homebrew/Cellar/[email protected]/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

0x1FFFFF avatar Mar 04 '23 06:03 0x1FFFFF

ye, i got this error too, it seems we need to wait update from pytorch

wllbll avatar Mar 22 '23 03:03 wllbll

Hi, I believe PyTorch has support for most of the functions now. Plus it can be run with env arg PYTORCH_ENABLE_MPS_FALLBACK=1 for those functions that aren't supported yet to fall back on the CPU.

I'm running pyannote and other projects with PyTorch compiled with support for mps so this should also be do-able

skye-repos avatar Jul 04 '23 17:07 skye-repos

@Harith163 That is correct. Pytorch has implemented "mps"(Metal Performance Shaders) for a at least a year now, Open-AI's Whisper supports "mps" as well but faster-whisper used by WhisperX apparently only supports "cpu" and "gpu".

I myself get "unsupported device mps", here is the error:

--> 128 self.model = ctranslate2.models.Whisper( 129 model_path, 130 device=device, 131 device_index=device_index, 132 compute_type=compute_type, 133 intra_threads=cpu_threads, 134 inter_threads=num_workers, 135 )

For me with Mac M1 "cpu" is extremely slow to the point I have not been able to get e proper transcription.

Any workaround to the issue? I believe this is essential for mac users.

Thanks :)

Herb-sh avatar Oct 15 '23 10:10 Herb-sh

@Harith163 That is correct. Pytorch has implemented "mps"(Metal Performance Shaders) for a at least a year now, Open-AI's Whisper supports "mps" as well but faster-whisper used by WhisperX apparently only supports "cpu" and "gpu".

I myself get "unsupported device mps", here is the error:

--> 128 self.model = ctranslate2.models.Whisper( 129 model_path, 130 device=device, 131 device_index=device_index, 132 compute_type=compute_type, 133 intra_threads=cpu_threads, 134 inter_threads=num_workers, 135 )

For me with Mac M1 "cpu" is extremely slow to the point I have not been able to get e proper transcription.

Any workaround to the issue? I believe this is essential for mac users.

Thanks :)

Whisper.cpp is fast on Apple Silicon ("Plain C/C++ implementation without dependencies" … "optimized via ARM NEON, Accelerate framework, Metal and Core ML"). However, I believe it only supports very rudimentary diarization currently.

Ideally, WhisperX's solutions for diarization, etc, could be made to work in the fashion of Whisper.cpp.

7k50 avatar Oct 18 '23 23:10 7k50

@Harith163 That is correct. Pytorch has implemented "mps"(Metal Performance Shaders) for a at least a year now, Open-AI's Whisper supports "mps" as well but faster-whisper used by WhisperX apparently only supports "cpu" and "gpu". I myself get "unsupported device mps", here is the error: --> 128 self.model = ctranslate2.models.Whisper( 129 model_path, 130 device=device, 131 device_index=device_index, 132 compute_type=compute_type, 133 intra_threads=cpu_threads, 134 inter_threads=num_workers, 135 ) For me with Mac M1 "cpu" is extremely slow to the point I have not been able to get e proper transcription. Any workaround to the issue? I believe this is essential for mac users. Thanks :)

Whisper.cpp is fast on Apple Silicon ("Plain C/C++ implementation without dependencies" … "optimized via ARM NEON, Accelerate framework, Metal and Core ML"). However, I believe it only supports very rudimentary diarization currently.

Ideally, WhisperX's solutions for diarization, etc, could be made to work in the fashion of Whisper.cpp.

That's only ideally, whisper.cpp's creator showed interest in the killer features of WhisperX but stated they are not coming any time soon.

I'd rather fix whisperx to work better on M1/Apple Silicon

purpshell avatar Oct 28 '23 04:10 purpshell

Just wanted to second this. I love whisperx on my PC, but on Mac it is just so slow.

Its resulted in fragmentation where if I want my script to be universal I have to look elsewhere. Really wish this could be supported.

1Dbcj avatar Dec 20 '23 23:12 1Dbcj

Any progress on this so far?

AdrienLF avatar Apr 09 '24 16:04 AdrienLF

Any news ?

thibaudbrg avatar Jul 19 '24 15:07 thibaudbrg

Any news?

colin4k avatar Sep 09 '24 12:09 colin4k

It could be that ctranslate2 needs to be built on your macOS system with WITH_ACCELERATE set to ON. The owner of ctranslate2 says is may be possible here, in reference to this comment.

aredwine3 avatar Sep 11 '24 21:09 aredwine3

Thanks for the tip! ctranslate2 install instructions are here: https://opennmt.net/CTranslate2/installation.html#install-from-sources

On my macOS, I needed to use the following for the cmake step:

cmake -DWITH_ACCELERATE=ON -DOPENMP_RUNTIME=COMP -DWITH_MKL=OFF ..

After compiling and installing ctranslate2 myself in a fresh pyenv with MPS enabled on my M1 Mac, I installed whisperx (which used my existing ctranslate2 install: Requirement already satisfied: ctranslate2<5,>=4.0 in /Users/ryan/.pyenv/versions/3.12.4/envs/ctranslate-build/lib/python3.12/site-packages (from faster-whisper==1.0.0->whisperx==3.1.1) (4.4.0)) and ran with --device mps but still got:

Traceback (most recent call last):
  File "/Users/ryan/.pyenv/versions/ctranslate-build/bin/whisperx", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/Users/ryan/.pyenv/versions/3.12.4/envs/ctranslate-build/lib/python3.12/site-packages/whisperx/transcribe.py", line 170, in cli
    model = load_model(model_name, device=device, device_index=device_index, download_root=model_dir, compute_type=compute_type, language=args['language'], asr_options=asr_options, vad_options={"vad_onset": vad_onset, "vad_offset": vad_offset}, task=task, threads=faster_whisper_threads)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryan/.pyenv/versions/3.12.4/envs/ctranslate-build/lib/python3.12/site-packages/whisperx/asr.py", line 288, in load_model
    model = model or WhisperModel(whisper_arch,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryan/.pyenv/versions/3.12.4/envs/ctranslate-build/lib/python3.12/site-packages/faster_whisper/transcribe.py", line 133, in __init__
    self.model = ctranslate2.models.Whisper(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: unsupported device mps

ryanfb avatar Sep 12 '24 12:09 ryanfb

After looking at how WITH_ACCELERATE and devices are implemented in CTranslate2 itself, it seems that enabling MPS allows the "cpu" device backend to use MPS accleration, you can't use it yet as a separate "mps" device. So after compiling/installing CTranslate2 from source as above, then installing whisperx, invoke whisperx with:

--device cpu --compute_type float32

And it should use MPS acceleration for matrix multiplication.

NB: I only get about a 10-15% reduction in overall processing times from my experimentation on an M1 Max. Possible that other chips get better results, or that the matrix multiplication operations offloaded to MPS aren't bottlenecking the overall process, or that the CPU/MPS handoff negates some possible performance gains.

You may also want to instead install CTranslate2 from source with the cmake step:

cmake -DCMAKE_OSX_ARCHITECTURES=arm64 -DWITH_ACCELERATE=ON -DWITH_MKL=OFF -DOPENMP_RUNTIME=COMP -DWITH_RUY=ON ..

Which should give you a fallback to let you do --compute_type int8 matrix multiplication on CPU without MPS (though I haven't had the time to create a new pyenv to test this yet).

ryanfb avatar Sep 12 '24 20:09 ryanfb

After looking at how WITH_ACCELERATE and devices are implemented in CTranslate2 itself, it seems that enabling MPS allows the "cpu" device backend to use MPS accleration, you can't use it yet as a separate "mps" device. So after compiling/installing CTranslate2 from source as above, then installing whisperx, invoke whisperx with:

--device cpu --compute_type float32

And it should use MPS acceleration for matrix multiplication.

NB: I only get about a 10-15% reduction in overall processing times from my experimentation on an M1 Max. Possible that other chips get better results, or that the matrix multiplication operations offloaded to MPS aren't bottlenecking the overall process, or that the CPU/MPS handoff negates some possible performance gains.

You may also want to instead install CTranslate2 from source with the cmake step:

cmake -DCMAKE_OSX_ARCHITECTURES=arm64 -DWITH_ACCELERATE=ON -DWITH_MKL=OFF -DOPENMP_RUNTIME=COMP -DWITH_RUY=ON ..

Which should give you a fallback to let you do --compute_type int8 matrix multiplication on CPU without MPS (though I haven't had the time to create a new pyenv to test this yet).

got this error now: [ 38%] Linking CXX shared library libctranslate2.dylib Undefined symbols for architecture arm64: "std::exception_ptr::__from_native_exception_pointer(void*)", referenced from: std::__1::promise<ctranslate2::TranslationResult>::~promise() in buffered_translation_wrapper.cc.o std::__1::promise<ctranslate2::EncoderForwardOutput>::~promise() in encoder.cc.o std::__1::promise<ctranslate2::GenerationResult>::~promise() in generator.cc.o std::__1::promise<ctranslate2::ScoringResult>::~promise() in generator.cc.o std::__1::promise<ctranslate2::StorageView>::~promise() in generator.cc.o std::__1::promise<ctranslate2::StorageView>::~promise() in wav2vec2.cc.o std::__1::promise<ctranslate2::StorageView>::~promise() in whisper.cc.o ... "___cxa_init_primary_exception", referenced from: std::__1::promise<ctranslate2::TranslationResult>::~promise() in buffered_translation_wrapper.cc.o std::__1::promise<ctranslate2::EncoderForwardOutput>::~promise() in encoder.cc.o std::__1::promise<ctranslate2::GenerationResult>::~promise() in generator.cc.o std::__1::promise<ctranslate2::ScoringResult>::~promise() in generator.cc.o std::__1::promise<ctranslate2::StorageView>::~promise() in generator.cc.o std::__1::promise<ctranslate2::StorageView>::~promise() in wav2vec2.cc.o std::__1::promise<ctranslate2::StorageView>::~promise() in whisper.cc.o ... ld: symbol(s) not found for architecture arm64 clang++: error: linker command failed with exit code 1 (use -v to see invocation) make[2]: *** [libctranslate2.4.4.0.dylib] Error 1 make[1]: *** [CMakeFiles/ctranslate2.dir/all] Error 2 make: *** [all] Error 2

colin4k avatar Sep 13 '24 06:09 colin4k

I found the diarization function also need to be changed to support MPS, in transcribe.py, change the following code diarize_model = DiarizationPipeline(use_auth_token=hf_token, device=device) to diarize_model = DiarizationPipeline(use_auth_token=hf_token, device='mps')

colin4k avatar Sep 13 '24 12:09 colin4k

Was anyone able to run it on mac? @colin4k with your latest comment on changing the diarization fuction, were you able to make it work?

estcap2 avatar Sep 17 '24 14:09 estcap2

Was anyone able to run it on mac? @colin4k with your latest comment on changing the diarization fuction, were you able to make it work?

I can run it on mac, but the speed of transcribing is very slow.

colin4k avatar Sep 18 '24 11:09 colin4k

Was anyone able to run it on mac? @colin4k with your latest comment on changing the diarization fuction, were you able to make it work?

And after I changed the code of diarization function, the speed of diarization changed fastly.

colin4k avatar Sep 18 '24 11:09 colin4k