pyannote-audio speaker-diarization-3.1 using 0% gpu

I'm using Google Colab and it is utilizing 0% gpu. Sometimes it uses 100% for a second and then it goes back to 0%. The audio is about 1.5 hours long. Is this normal behaviour? Keep in mind CPU is at 100% almost all of the time.

Nov 21 '23 19:11 MEPO29

Thank you for your issue.You might want to check the FAQ if you haven't done so already.

Feel free to close this issue if you found an answer in the FAQ.

If your issue is a feature request, please read this first and update your request accordingly, if needed.

If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:

installation
data preparation
model download
etc.

Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).

Companies relying on pyannote.audio in production may contact me via email regarding:

paid scientific consulting around speaker diarization and speech processing in general;
custom models and tailored features (via the local tech transfer office).

This is an automated reply, generated by FAQtory

Nov 21 '23 19:11 github-actions[bot]

Same here

Nov 23 '23 13:11 petros94

I somehow fixed this by using

pip uninstall pyannote.audio
conda install -c conda-forge pyannote.core
pip install pyannote.audio

on my M3

Nov 24 '23 11:11 MohammedAlhajji

this worked for me after uninstalling onnxruntime and onnxruntime-gpu pip install optimum[onnxruntime-gpu] See this link: https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/gpu#installation

Nov 24 '23 17:11 KennethTrinh

@KennethTrinh I don't understand how that'll work, because new version isn't dependent on onnx.

Nov 27 '23 08:11 arnavmehta7

@KennethTrinh I tested your solution and It didn't work for me.

Nov 27 '23 10:11 pourmand1376

oops I was running in a different environment than you guys (was in ec2 instance) - apologies! I tried to reproduce in a colab notebook with a plain old T4 gpu:

Non-blocking code to poll the gpu for usage and memory (I don't have cloud shell since I'm poor!) - run this first if you don't have cloud shell

import threading
import subprocess
import time

def run_nvidia_smi():
    while True:
        try:
            output = subprocess.check_output(
                ['nvidia-smi', '--query-gpu=timestamp,utilization.gpu,utilization.memory', '--format=csv']
            )
            
            output_str = output.decode('utf-8').strip()
            with open('output.log', 'a') as f:
                f.write(output_str + '\n')
        except subprocess.CalledProcessError as e:
            print(f"Error running nvidia-smi: {e}")
        time.sleep(1)

thread = threading.Thread(target=run_nvidia_smi)
thread.start()

Code to run the diarization - don't forget to define your `TOKEN` beforehand

!pip install -q --upgrade pyannote.audio
!pip install -q transformers==4.35.2
!pip install -q datasets

import os
import torch
import soundfile as sf
import json
import transformers

from pyannote.audio import Pipeline
from datasets import load_dataset


DIARIZATION_MODEL = "pyannote/speaker-diarization-3.1"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


concatenated_librispeech = load_dataset("sanchit-gandhi/concatenated_librispeech", split="train", streaming=True)
sample = next(iter(concatenated_librispeech))

diarization_pipeline = Pipeline.from_pretrained(
        DIARIZATION_MODEL, 
        use_auth_token = TOKEN,
    ).to(device)

input_tensor = torch.from_numpy(sample['audio']['array']).float().unsqueeze(0) # (channel, time)
output = diarization_pipeline({"waveform": input_tensor, "sample_rate": sample['audio']['sampling_rate']})
output

My logs show that the gpu was indeed used (albeit very little, but my audio was (347360,) samples , so may change with your audio). The key difference is that I'm passing in a torch tensor of shape (channel, time) , but you guys are just passing the .wav file?

timestamp, utilization.gpu [%], utilization.memory [%]
2023/11/27 19:34:24.171, 1 %, 0 %
timestamp, utilization.gpu [%], utilization.memory [%]
2023/11/27 19:34:25.188, 6 %, 1 %
timestamp, utilization.gpu [%], utilization.memory [%]
2023/11/27 19:34:26.208, 5 %, 0 %

Nov 27 '23 19:11 KennethTrinh

I found that when passing a filename to the speaker diarization pipeline, I got very poor performance. Upon profiling I discovered this was due to many many calls to get_torchaudio_info and torchaudio._backend.ffmpeg._load_audio_fileobj. This indicated to me that the file was being reprocessed many times unnecessarily (this was all coming from the "crop" method). I noticed that there were very different codepaths if the incoming file object already had a "waveform" computed, so I did the following:

import torchaudio

waveform, sample_rate = torchaudio.load("segment_0.wav")
audio_file = {"waveform": waveform, "sample_rate": sample_rate}

and then I passed audio_file to my pipeline. This took my runtime from 5m to 13s.

I suspect it would be straightforward to change the code to perform this step internally initially and save the user the trouble... and also probably simplify a lot of the downstream code as it could just assume it always had a waveform.

Feb 01 '24 23:02 jaffee

https://github.com/pyannote/pyannote-audio/issues/1557#issuecomment-1922466847 (comment above) solved it also for me, using an eGPU and cuda.

Feb 02 '24 12:02 rmeissn

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 01 '24 09:08 stale[bot]

pyannote-audio pyannote-audio copied to clipboard

speaker-diarization-3.1 using 0% gpu

Non-blocking code to poll the gpu for usage and memory (I don't have cloud shell since I'm poor!) - run this first if you don't have cloud shell

Code to run the diarization - don't forget to define your TOKEN beforehand

pyannote-audio
pyannote-audio copied to clipboard

Code to run the diarization - don't forget to define your `TOKEN` beforehand