pyannote-audio Pyannote 3.1.0 still on CPU only?

I am sorry top open this issue again, but I am still experiencing Pyannote version 3.1.0 running on CPU only.

I just installed the latest version with:

pip3 install pyannote.audio

And I can confirm I have the latest version installed with:

pip list

And yet, I see my program using just the CPU. I am testing it with an RTX A5000.

Here is my code:

import sys
from pyannote.audio import Pipeline
import torch

fileOutWav = sys.argv[1] 
spkrsNo = int(sys.argv[2])
fileDiary = sys.argv[3]

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization", use_auth_token="xxxxxxxxxxxxx")

pipeline.to(torch.device("cuda"))

# 4. apply pretrained pipeline
diarization = pipeline(fileOutWav, num_speakers=spkrsNo)

# 5. print the result

with open(fileDiary, mode='w') as file_object:
	for turn, _, speaker in diarization.itertracks(yield_label=True):
		#print(f"start={turn.start:.2f}s stop={turn.end:.2f}s speaker_{speaker}")
		print(f"start={turn.start:.2f}s stop={turn.end:.2f}s speaker_{speaker}", file=file_object)

Is there anything wrong with my code? Or any other steps I might have missed?

I am using the latest version of torch on Linux.

Nov 25 '23 21:11 fablau

Thank you for your issue.You might want to check the FAQ if you haven't done so already.

Feel free to close this issue if you found an answer in the FAQ.

If your issue is a feature request, please read this first and update your request accordingly, if needed.

If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:

installation
data preparation
model download
etc.

Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).

Companies relying on pyannote.audio in production may contact me via email regarding:

paid scientific consulting around speaker diarization and speech processing in general;
custom models and tailored features (via the local tech transfer office).

This is an automated reply, generated by FAQtory

Nov 25 '23 21:11 github-actions[bot]

You are using the wrong pretrained pipeline. Switch from pyannote/speaker-diarization to pyannote/speaker-diarization-3.1.

Nov 26 '23 06:11 hbredin

Thank you, I tried that and got this error:

pipeline.to(torch.device("cuda"))
2023-11-27T06:25:21.370318815Z AttributeError: 'NoneType' object has no attribute 'to'

Do I need to remove that line? Is that no longer needed?

Nov 27 '23 06:11 fablau

Looks like you forgot to request access to this new pipeline on HuggingFace model hub.

Nov 27 '23 06:11 hbredin

Hi @hbredin I also tried the latest 3.1.0 version with 3.1 model. However, it's also extremely slow for me. 5min of audio takes around ~5min to just diarize.

Nov 27 '23 08:11 arnavmehta7

I am having the same problem here. It is extremely slow.

Nov 27 '23 10:11 pourmand1376

Tagging this issue as cannot reproduce. Please provide a minimal reproducible example on Google Colab.

Nov 27 '23 11:11 hbredin

You can also upload your audio file here to get an idea of the expected processing speed on a T4 GPU.

Nov 27 '23 11:11 hbredin

It seems that the problem was in my installation.

I used this as requirements.txt (found from here):

gradio==3.38.0
--extra-index-url https://download.pytorch.org/whl/cu113
torch==2.0.1
pyannote-audio==3.1.0

And this for Dockerfile.

FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y --no-install-recommends \
    git \
    git-lfs \
    wget \
    curl \
    # python build dependencies \
    build-essential \
    libssl-dev \
    zlib1g-dev \
    libbz2-dev \
    libreadline-dev \
    libsqlite3-dev \
    libncursesw5-dev \
    xz-utils \
    tk-dev \
    libxml2-dev \
    libxmlsec1-dev \
    libffi-dev \
    liblzma-dev \
    # gradio dependencies \
    ffmpeg \
    ca-certificates \
    # fairseq2 dependencies \
    libsndfile-dev && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN useradd -m -u 1000 user
USER user
ENV HOME=/home/user \
    PATH=/home/user/.local/bin:${PATH}

WORKDIR ${HOME}

RUN git clone https://github.com/yyuu/pyenv.git .pyenv

ENV PATH=${HOME}/.pyenv/shims:${HOME}/.pyenv/bin:${PATH}

ARG PYTHON_VERSION=3.10
RUN pyenv install ${PYTHON_VERSION} && \
    pyenv global ${PYTHON_VERSION} && \
    pyenv rehash && \
    pip install --no-cache-dir -U pip setuptools wheel

COPY --chown=1000 ./requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /tmp/requirements.txt

COPY --chown=1000 . ${HOME}/app
ENV PYTHONPATH=${HOME}/app \
    PYTHONUNBUFFERED=1 \
    GRADIO_ALLOW_FLAGGING=never \
    GRADIO_NUM_PORTS=1 \
    GRADIO_SERVER_NAME=0.0.0.0 \
    GRADIO_THEME=huggingface \
    SYSTEM=spaces \
    GRADIO_SERVER_PORT=7860
EXPOSE 7860
WORKDIR ${HOME}/app
CMD ["python", "app.py"]

I do not know if it is using GPU or not. But without this, It took around 90 minutes to process a 110 minute file. Now, It takes around 1~2 minutes.

Nov 27 '23 13:11 pourmand1376

@pourmand1376 thank you for providing your Docker code, could you also please provide the python code you used for the diarization using Pyannote?

Nov 27 '23 15:11 fablau

Looks like you forgot to request access to this new pipeline on HuggingFace model hub.

How to do that?

Nov 27 '23 16:11 fablau

The same way you already did for the old pipeline. By visiting hf.co/pyannote/speaker-diarization-3.1 and agreeing on the terms.

Nov 27 '23 18:11 hbredin

Thanks, I could fix the error I posted above by simply re-accepting the terms at the links below:

https://hf.co/pyannote/segmentation-3.0 https://hf.co/pyannote/speaker-diarization-3.1

So that my authorization code worked again.

I am still investigating the missing GPU usage... I'll be back as soon as I find out more.

Nov 27 '23 18:11 fablau

Yes! It looks like the requirements @pourmand1376 posted above fixed the problem! Now I see the GPU being used ;)

My guess, in particular, is the following one:

--extra-index-url https://download.pytorch.org/whl/cu113

Because I tried the other ones singularly and didn't do the trick.

Nov 27 '23 18:11 fablau

@pourmand1376 thank you for providing your Docker code, could you also please provide the python code you used for the diarization using Pyannote?

Here (this is not a Miminal Example but rather it splits the file and creates a zip file for the user):

import gradio as gr
import os
from dotenv import load_dotenv
from pydub import AudioSegment
from pathlib import Path
import torch
from pyannote.audio import Pipeline

load_dotenv()

HF_API = os.getenv("HF_API")

print(f"HF API Length: {len(HF_API)}")
DESCRIPTION = """
# Speaker Diarization v3.1.0
"""


pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1", use_auth_token=HF_API
)
pipeline.to(torch.device("cuda"))


import os
import zipfile


def zip_folder(folder_path):
    folder_name = os.path.basename(folder_path)
    zip_path = f"{folder_name}.zip"
    zip_file = zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED)
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            zip_file.write(os.path.join(root, file))
    zip_file.close()
    return zip_path


import os
import shutil


def rmrf(path):
    if os.path.isfile(path):
        os.remove(path)
    elif os.path.isdir(path):
        shutil.rmtree(path)


def predict(number_of_speakers, audio_source, input_audio_mic, input_audio_file):
    if audio_source == "microphone":
        input_data = input_audio_mic
    else:
        input_data = input_audio_file

    print(input_data)

    if number_of_speakers == 0:
        diarization = pipeline(input_data)
    else:
        diarization = pipeline(input_data, num_speakers=number_of_speakers)

    text_output = ""
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        print(f"start={turn.start}s stop={turn.end}s speaker_{speaker}")
        text_output = (
            text_output
            + f"start={turn.start}s stop={turn.end}s speaker_{speaker}"
            + "\n"
        )

    song = AudioSegment.from_wav(input_data)
    rmrf("files")
    print(Path("files").absolute)
    Path("files").mkdir(exist_ok=True, parents=True)
    for i, (turn, _, speaker) in enumerate(diarization.itertracks(yield_label=True)):
        try:
            clipped = song[turn.start * 1000 : turn.end * 1000]
            clipped.export(f"files/{i:03}.wav", format="wav", bitrate=16000)

        except Exception as e:
            print(e)

    output_path = zip_folder("files")
    return (text_output, output_path)


def update_audio_ui(audio_source: str) -> tuple[dict, dict]:
    mic = audio_source == "microphone"
    return (
        gr.update(visible=mic, value=None),  # input_audio_mic
        gr.update(visible=not mic, value=None),  # input_audio_file
    )


with gr.Blocks(css="style.css") as demo:
    gr.Markdown(DESCRIPTION)
    with gr.Group():
        with gr.Row():
            number_of_speakers = gr.Number(
                label="Number of Speakers",
                info="Keep it zero, if you want the model to automatically detect the number of speakers",
            )
        with gr.Row() as audio_box:
            audio_source = gr.Radio(
                choices=["file", "microphone"], value="file", interactive=True
            )
            input_audio_mic = gr.Audio(
                label="Input speech",
                type="filepath",
                source="microphone",
                visible=False,
            )
            input_audio_file = gr.Audio(
                label="Input speech",
                type="filepath",
                source="upload",
                visible=True,
            )
            final_audio = gr.Audio(label="Output", visible=False)
        audio_source.change(
            fn=update_audio_ui,
            inputs=audio_source,
            outputs=[input_audio_mic, input_audio_file],
            queue=False,
            api_name=False,
        )
        input_audio_mic.change(lambda x: x, input_audio_mic, final_audio)
        input_audio_file.change(lambda x: x, input_audio_file, final_audio)
        submit = gr.Button("Submit")
        text_output = gr.Textbox(
            label="Transcribed Text",
            value="",
            interactive=False,
            lines=10,
            scale=10,
            max_lines=10,
        )
        file_output = gr.File(label="output")

        submit.click(
            fn=predict,
            inputs=[
                number_of_speakers,
                audio_source,
                input_audio_mic,
                input_audio_file,
            ],
            outputs=[text_output, file_output],
            api_name="predict",
        )


demo.queue(max_size=50).launch()

Nov 28 '23 05:11 pourmand1376

It seems that the problem was in my installation.

I used this as requirements.txt (found from here):

gradio==3.38.0
--extra-index-url https://download.pytorch.org/whl/cu113
torch==2.0.1
pyannote-audio==3.1.0

And this for Dockerfile.

FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y --no-install-recommends \
    git \
    git-lfs \
    wget \
    curl \
    # python build dependencies \
    build-essential \
    libssl-dev \
    zlib1g-dev \
    libbz2-dev \
    libreadline-dev \
    libsqlite3-dev \
    libncursesw5-dev \
    xz-utils \
    tk-dev \
    libxml2-dev \
    libxmlsec1-dev \
    libffi-dev \
    liblzma-dev \
    # gradio dependencies \
    ffmpeg \
    ca-certificates \
    # fairseq2 dependencies \
    libsndfile-dev && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN useradd -m -u 1000 user
USER user
ENV HOME=/home/user \
    PATH=/home/user/.local/bin:${PATH}

WORKDIR ${HOME}

RUN git clone https://github.com/yyuu/pyenv.git .pyenv

ENV PATH=${HOME}/.pyenv/shims:${HOME}/.pyenv/bin:${PATH}

ARG PYTHON_VERSION=3.10
RUN pyenv install ${PYTHON_VERSION} && \
    pyenv global ${PYTHON_VERSION} && \
    pyenv rehash && \
    pip install --no-cache-dir -U pip setuptools wheel

COPY --chown=1000 ./requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /tmp/requirements.txt

COPY --chown=1000 . ${HOME}/app
ENV PYTHONPATH=${HOME}/app \
    PYTHONUNBUFFERED=1 \
    GRADIO_ALLOW_FLAGGING=never \
    GRADIO_NUM_PORTS=1 \
    GRADIO_SERVER_NAME=0.0.0.0 \
    GRADIO_THEME=huggingface \
    SYSTEM=spaces \
    GRADIO_SERVER_PORT=7860
EXPOSE 7860
WORKDIR ${HOME}/app
CMD ["python", "app.py"]

I do not know if it is using GPU or not. But without this, It took around 90 minutes to process a 110 minute file. Now, It takes around 1~2 minutes.

this worked for me too.

specifically, what i did was create a requirements.txt file:

with the contents:

gradio==3.38.0
--extra-index-url https://download.pytorch.org/whl/cu113
torch==2.0.1
pyannote-audio==3.1.0

Then install it with pip install -r requirements.txt.

Now, I can run some simple code:

In [1]: from pyannote.audio import Pipeline

In [2]: import torch

In [3]: pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
torchvision is not available - cannot save figures

In [4]: pipeline.to(torch.device("cuda"))
Out[4]: <pyannote.audio.pipelines.speaker_diarization.SpeakerDiarization at 0x7f2ce8f143d0>

In [5]: diarization = pipeline("/tmp/tmphgpfklya.wav")

And $ nvidia-smi -l 1 shows:

It took me quite a while to find this solution. Should it be added to README? Why is this version of torch required for the GPU to be properly utilized?

Nov 30 '23 05:11 EarningsCall

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jun 01 '24 10:06 stale[bot]

What is more weird on my side that 3.1 model works sometimes on GPU, sometimes on CPU, but 3.0 model always works on GPU. So, I specifically wrote a bit of code to choose between models. I always start with 3.1 because it does the segmentation faster. But then, if I see within 5 seconds that it's using CPU instead of GPU, I cancel that and re-run it with 3.0. Who knows...

Jul 25 '24 21:07 helLf1nGer

pyannote-audio pyannote-audio copied to clipboard

Pyannote 3.1.0 still on CPU only?

pyannote-audio
pyannote-audio copied to clipboard