whisperX icon indicating copy to clipboard operation
whisperX copied to clipboard

Diarization too slow

Open MitPitt opened this issue 1 year ago • 19 comments

1 hour 30 minutes of audio were processing for over 1 hour in the diarization... stage. I'm using an RTX 3090.

I'm guessing --batch_size doesn't affect pyannote. A setting for pyannote's batch size would be very nice to have.

MitPitt avatar May 25 '23 12:05 MitPitt

I'm having the same issue. From what i'm reading, the pyannote/speaker-diarization model is slow, but word-level segmentation may be slowing it down even more. I assume there are factors that impact this more than others (i think number of speakers or number of segments influences this the most, but that's just a guess). Looking at hardware usage during runtime, looks like it's batching either one segment at a time or one word at a time (this would make sense, since we're chasing word-level timestamps with whisperx. The pyannote model reports a 2.5% realtime factor, which is definitely NOT been my experience, but may be the case if you ran the raw audio through without segmentation). Maybe there's a way to count individual calls to the GPU to verify. I haven't found a workaround yet, let me know if you find something out.

jzeller2011 avatar May 25 '23 15:05 jzeller2011

I have the same issue.

moritzbrantner avatar May 25 '23 23:05 moritzbrantner

https://github.com/m-bain/whisperX/issues/159#issuecomment-1540035916

DigilConfianz avatar May 26 '23 04:05 DigilConfianz

1 hour 30 minutes of audio were processing for over 1 hour in the diarization... stage. I'm using an RTX 3090.

That's very strange, it should not be that long, I would expect 5-10mins max. I suspect some bug here.

I'm guessing --batch_size doesn't affect pyannote. A setting for pyannote's batch size would be very nice to have.

I would assume most of the time is the clustering step, which can be recursive and can take long if its not finding satisfactory cluster sizes.

From what i'm reading, the pyannote/speaker-diarization model is slow, but word-level segmentation may be slowing it down even more.

Nah the ASR and word-level segmentation is ran independently of the diarization. The diarization is just running a standard pyannote pipeline. So word-level segmentation / whisperx batching shouldnt effect this

m-bain avatar May 26 '23 20:05 m-bain

@m-bain I'm also having extremely slow diarization. Using CLI.

Just now, to explore further, I also tried setting the --threads parameter to 50 to see if that would do something (I would prefer GPU!) and it is now making use of a variable number of threads, but well about four, which is what it had seemed to be limited to by default. There is still some GPU memory allocated even in the diarization stage, but not a ton. Very naive question--could things be slow because all of us have pyannote using CPU for some reason? Is there a way to specify that whisperx's pyannote must use GPU?

For reference, in case it helps:

>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
2
>>> torch.version.cuda
'11.7'

geoglrb avatar May 30 '23 01:05 geoglrb

There is an issue regarding pyannote not using GPU, but it should not occur with whisperx. To read more on this, see pyannote/pyannote-audio#1354. It might have something to do with the device index though. Are both of your GPUs the same size? We're currently not passing device_index to the diarization, so we will simply do to('cuda') on loading the diarization model. This might be a problem when multiple GPUs are available.

sorgfresser avatar Jun 02 '23 10:06 sorgfresser

I am also having an extremely long, ie overnight, diarization on the command line. The transcription occurs, I get two failures in the align segment and then diarization occurs, and I get the following errors:

Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.2. To apply the upgrade to your files permanently, run python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x. Model was trained with torch 1.10.0+cu102, yours is 2.0.1. Bad things might happen unless you revert torch to 1.x.

and then I left it running overnight and still in the same state.

goneill avatar Jun 07 '23 13:06 goneill

Please try my suggestion in https://github.com/m-bain/whisperX/issues/399 and see if it helps you too. I'm getting around 30sec for diarization of 30 minute video using the standard model in the pyannote/speaker-diarization pipeline (speechbrain/spkrec-ecapa-voxceleb), and around 15sec if I change the embedding model to pyannote/embedding

davidas1 avatar Aug 01 '23 13:08 davidas1

@davidas1 There is speed improvement when changing to whisper loaded audio from the raw audio file as you suggested. Thanks for that. How to change the embedding model in code?

DigilConfianz avatar Aug 01 '23 15:08 DigilConfianz

Changing the pyannote pipeline is a bit more involved - I'm using an offline pipeline like described in https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/applying_a_pipeline.ipynb I had to patch whisperx a bit to allow working with a custom local pipeline. Using this method you can customize the pipeline by editing the config.yaml (change the "embedding" configuration to the desired model).

davidas1 avatar Aug 01 '23 15:08 davidas1

Please try my suggestion in #399 and see if it helps you too. I'm getting around 30sec for diarization of 30 minute video using the standard model in the pyannote/speaker-diarization pipeline (speechbrain/spkrec-ecapa-voxceleb), and around 15sec if I change the embedding model to pyannote/embedding

what??? thats crazy! here is my timings for 30 minute long mp3: transcribe time: 69 seconds align time: 10 seconds diarization: 24 seconds around 90 seconds in total, like 3 times longer than yours, and thats excluding the initial model loadings.

could you please suggest something like a checklist for speeding things up? i also updated to get your recet patch and it did speed up my diarization exponentially

datacurse avatar Aug 03 '23 02:08 datacurse

I wrote that diarization takes 30sec, not the entire pipeline - before the change the diarization took almost 2 minutes. Your timing looks great, other than the transcribe step that is faster on my setup, but that's probably due to the GPU you're using.

davidas1 avatar Aug 03 '23 07:08 davidas1

oooh i see that clears things. i got 4090 tho

datacurse avatar Aug 06 '23 16:08 datacurse

I'm looking for some help or insight into why diarization is so slow for me.

I have a recording that is 1 minute and 14 seconds with two native English speakers and diarization takes 11 minutes and 49 seconds (transcription took 6 seconds). I'm running on a Mac mini with an M2 chip and 8GB of RAM. I assume in this case it's running on CPU although I'm not sure with the Apple silicon. I'm basically using the default example on the README for transcribing and diarizing a file.

With a longer file (27 minutes and 39 seconds), with multiple speakers, it takes 2 minutes and 47 seconds to transcribe, 1 minute and 6 seconds to align but 12 hours, 48 minutes to diarize!

dantheman0207 avatar Aug 28 '23 13:08 dantheman0207

Same here. I'm getting 2-3% GPU utilization 0.9 GB of GPU memory?

awhillas avatar Nov 27 '23 21:11 awhillas

same issue. Almost no GPU utilization and 1.5 hour of diarization per 60 minutes audio.

SergeiKarulin avatar Apr 09 '24 15:04 SergeiKarulin

same issue. Almost no GPU utilization and 1.5 hour of diarization per 60 minutes audio.

same here

eplinux avatar Apr 11 '24 17:04 eplinux

I also noticed that there seems to be some throttling affecting the GPU utilization on Windows 11. As soon as the terminal window is in the background, the GPU utilization drops dramatically

eplinux avatar Apr 16 '24 16:04 eplinux

@m-bain Diarization is a key aspect where multiple speakers are having a conversation. I've been exploring different ways to speed up transcription & diarization pipeline.

Can see lots of different options for speeding up transcription like : CTranslate2, Batching, Flash Attention, Distil-Whisper, ComputeTime (float32,16)

but finding very limited options for diarization speedup.

for a 20 minutes audio, with optimizations we are able to get transcriptions in around 35 seconds. But diarizing a 20 minute audio is taking roughly 1 minute via Nemo and around 45 seconds via Pyannote.

Could you please share if there is any direction which we can follow to speedup diarization process?

prkumar112451 avatar May 14 '24 14:05 prkumar112451