Diarization high memory usage not using dedicated gpu
Darization runs very slowly, uses almost 12gb of memory, and is seemingly not happening on the GPU (GPUz and Window's task manager show conflicting info)
- Latest WhisperX repo
- pyannote.audio 3.1.0
- onnxruntime-gpu
On interrupting the diarization step, the last call shows the following segment of code, it points to something happening on the CPU but I'm not sure if it's the main process. Admittedly I don't understand python code very well.
(whisperx) PS DIRECTORY> whisperx "Return to The Obra Dinn Ep1.opus" --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4 --task transcribe --lang en --diarize --hf_token TOKEN
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
torchvision is not available - cannot save figures
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.1.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\Victo\.cache\torch\whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.0. Bad things might happen unless you revert torch to 1.x.
>>Performing transcription...
>>Performing alignment...
>>Performing diarization...
Traceback (most recent call last):
File "C:\Users\USER\anaconda3\envs\whisperx\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\USER\anaconda3\envs\whisperx\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\USER\anaconda3\envs\whisperx\Scripts\whisperx.exe\__main__.py", line 7, in <module>
File "C:\Users\USER\anaconda3\envs\whisperx\lib\site-packages\whisperx\transcribe.py", line 220, in cli
diarize_segments = diarize_model(input_audio_path, min_speakers=min_speakers, max_speakers=max_speakers)
File "C:\Users\USER\anaconda3\envs\whisperx\lib\site-packages\whisperx\diarize.py", line 28, in __call__
segments = self.model(audio_data, min_speakers=min_speakers, max_speakers=max_speakers)
File "C:\Users\USER\anaconda3\envs\whisperx\lib\site-packages\pyannote\audio\core\pipeline.py", line 325, in __call__
return self.apply(file, **kwargs)
File "C:\Users\USER\anaconda3\envs\whisperx\lib\site-packages\pyannote\audio\pipelines\speaker_diarization.py", line 514, in apply
embeddings = self.get_embeddings(
File "C:\Users\USER\anaconda3\envs\whisperx\lib\site-packages\pyannote\audio\pipelines\speaker_diarization.py", line 349, in get_embeddings
embedding_batch: np.ndarray = self._embedding(
File "C:\Users\USER\anaconda3\envs\whisperx\lib\site-packages\pyannote\audio\pipelines\speaker_verification.py", line 709, in __call__
return embeddings.cpu().numpy()
KeyboardInterrupt
Extra testing:
It seems that, at least in my particular setup, the diarization model couldn't access the dedicated gpu over the integrated one. Setting my system to only use the dedicated GPU for everything ensured that it ran on it.
Memory usage is still high, and it takes much longer than previously. However those could very well be issues with the diarization model and not whisperx's implementation.
Extra extra testing:
Naively I had updated to the latest version of Pytorch through pip rather than conda. I'm not sure what the difference is under the hood, since it doesn't throw any errors or warning when running Whisper. However it causes diarization to take several hours longer and use 3x the memory.
Creating the environment from scratch making sure to use conda for Pytorch yielded the expected results.
A side effect of this seems to be that WhisperX can't be used outside of a conda environment, preventing it from being comfortably integrated into things like Subtitle Edit, which can now use Whisper and it's variants to automatically create subtitles.
All different ways of installing/using WhisperX and running it from a default windows prompt has the same problem of not correctly using the GPU for transcriptions or diarization, In Subtitle Edit, it returns a single period character instead of a proper transcription like vanilla Whisper or even Faster-Whisper.
I know this repo is more of a proof of concept than a tool that is intended for mass-use, However it does consistently yield results that I'm more happy with than other Whisper forks. It would be useful for it to work outside of a conda environment.
Torch 2.8.0 and cuda are not available online. There is no server that support that I tried to instal torch 2.6.0 and cuda, but whisperx doesn't want to work with that old sht Oh come on, this is total butt sht