pyannote-audio
pyannote-audio copied to clipboard
Pyannote 3.1.0 still on CPU only?
I am sorry top open this issue again, but I am still experiencing Pyannote version 3.1.0 running on CPU only.
I just installed the latest version with:
pip3 install pyannote.audio
And I can confirm I have the latest version installed with:
pip list
And yet, I see my program using just the CPU. I am testing it with an RTX A5000.
Here is my code:
import sys
from pyannote.audio import Pipeline
import torch
fileOutWav = sys.argv[1]
spkrsNo = int(sys.argv[2])
fileDiary = sys.argv[3]
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization", use_auth_token="xxxxxxxxxxxxx")
pipeline.to(torch.device("cuda"))
# 4. apply pretrained pipeline
diarization = pipeline(fileOutWav, num_speakers=spkrsNo)
# 5. print the result
with open(fileDiary, mode='w') as file_object:
for turn, _, speaker in diarization.itertracks(yield_label=True):
#print(f"start={turn.start:.2f}s stop={turn.end:.2f}s speaker_{speaker}")
print(f"start={turn.start:.2f}s stop={turn.end:.2f}s speaker_{speaker}", file=file_object)
Is there anything wrong with my code? Or any other steps I might have missed?
I am using the latest version of torch on Linux.
Thank you for your issue.You might want to check the FAQ if you haven't done so already.
Feel free to close this issue if you found an answer in the FAQ.
If your issue is a feature request, please read this first and update your request accordingly, if needed.
If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:
- installation
- data preparation
- model download
- etc.
Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).
Companies relying on pyannote.audio
in production may contact me via email regarding:
- paid scientific consulting around speaker diarization and speech processing in general;
- custom models and tailored features (via the local tech transfer office).
This is an automated reply, generated by FAQtory
You are using the wrong pretrained pipeline. Switch from pyannote/speaker-diarization to pyannote/speaker-diarization-3.1.
Thank you, I tried that and got this error:
pipeline.to(torch.device("cuda"))
2023-11-27T06:25:21.370318815Z AttributeError: 'NoneType' object has no attribute 'to'
Do I need to remove that line? Is that no longer needed?
Looks like you forgot to request access to this new pipeline on HuggingFace model hub.
Hi @hbredin I also tried the latest 3.1.0 version with 3.1 model. However, it's also extremely slow for me. 5min of audio takes around ~5min to just diarize.
I am having the same problem here. It is extremely slow.
Tagging this issue as cannot reproduce
.
Please provide a minimal reproducible example on Google Colab.
You can also upload your audio file here to get an idea of the expected processing speed on a T4 GPU.
It seems that the problem was in my installation.
I used this as requirements.txt
(found from here):
gradio==3.38.0
--extra-index-url https://download.pytorch.org/whl/cu113
torch==2.0.1
pyannote-audio==3.1.0
And this for Dockerfile.
FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install -y --no-install-recommends \
git \
git-lfs \
wget \
curl \
# python build dependencies \
build-essential \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
libncursesw5-dev \
xz-utils \
tk-dev \
libxml2-dev \
libxmlsec1-dev \
libffi-dev \
liblzma-dev \
# gradio dependencies \
ffmpeg \
ca-certificates \
# fairseq2 dependencies \
libsndfile-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN useradd -m -u 1000 user
USER user
ENV HOME=/home/user \
PATH=/home/user/.local/bin:${PATH}
WORKDIR ${HOME}
RUN git clone https://github.com/yyuu/pyenv.git .pyenv
ENV PATH=${HOME}/.pyenv/shims:${HOME}/.pyenv/bin:${PATH}
ARG PYTHON_VERSION=3.10
RUN pyenv install ${PYTHON_VERSION} && \
pyenv global ${PYTHON_VERSION} && \
pyenv rehash && \
pip install --no-cache-dir -U pip setuptools wheel
COPY --chown=1000 ./requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /tmp/requirements.txt
COPY --chown=1000 . ${HOME}/app
ENV PYTHONPATH=${HOME}/app \
PYTHONUNBUFFERED=1 \
GRADIO_ALLOW_FLAGGING=never \
GRADIO_NUM_PORTS=1 \
GRADIO_SERVER_NAME=0.0.0.0 \
GRADIO_THEME=huggingface \
SYSTEM=spaces \
GRADIO_SERVER_PORT=7860
EXPOSE 7860
WORKDIR ${HOME}/app
CMD ["python", "app.py"]
I do not know if it is using GPU or not. But without this, It took around 90 minutes to process a 110 minute file. Now, It takes around 1~2 minutes.
@pourmand1376 thank you for providing your Docker code, could you also please provide the python code you used for the diarization using Pyannote?
Looks like you forgot to request access to this new pipeline on HuggingFace model hub.
How to do that?
The same way you already did for the old pipeline. By visiting hf.co/pyannote/speaker-diarization-3.1 and agreeing on the terms.
Thanks, I could fix the error I posted above by simply re-accepting the terms at the links below:
https://hf.co/pyannote/segmentation-3.0 https://hf.co/pyannote/speaker-diarization-3.1
So that my authorization code worked again.
I am still investigating the missing GPU usage... I'll be back as soon as I find out more.
Yes! It looks like the requirements @pourmand1376 posted above fixed the problem! Now I see the GPU being used ;)
My guess, in particular, is the following one:
--extra-index-url https://download.pytorch.org/whl/cu113
Because I tried the other ones singularly and didn't do the trick.
@pourmand1376 thank you for providing your Docker code, could you also please provide the python code you used for the diarization using Pyannote?
Here (this is not a Miminal Example but rather it splits the file and creates a zip file for the user):
import gradio as gr
import os
from dotenv import load_dotenv
from pydub import AudioSegment
from pathlib import Path
import torch
from pyannote.audio import Pipeline
load_dotenv()
HF_API = os.getenv("HF_API")
print(f"HF API Length: {len(HF_API)}")
DESCRIPTION = """
# Speaker Diarization v3.1.0
"""
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1", use_auth_token=HF_API
)
pipeline.to(torch.device("cuda"))
import os
import zipfile
def zip_folder(folder_path):
folder_name = os.path.basename(folder_path)
zip_path = f"{folder_name}.zip"
zip_file = zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED)
for root, dirs, files in os.walk(folder_path):
for file in files:
zip_file.write(os.path.join(root, file))
zip_file.close()
return zip_path
import os
import shutil
def rmrf(path):
if os.path.isfile(path):
os.remove(path)
elif os.path.isdir(path):
shutil.rmtree(path)
def predict(number_of_speakers, audio_source, input_audio_mic, input_audio_file):
if audio_source == "microphone":
input_data = input_audio_mic
else:
input_data = input_audio_file
print(input_data)
if number_of_speakers == 0:
diarization = pipeline(input_data)
else:
diarization = pipeline(input_data, num_speakers=number_of_speakers)
text_output = ""
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"start={turn.start}s stop={turn.end}s speaker_{speaker}")
text_output = (
text_output
+ f"start={turn.start}s stop={turn.end}s speaker_{speaker}"
+ "\n"
)
song = AudioSegment.from_wav(input_data)
rmrf("files")
print(Path("files").absolute)
Path("files").mkdir(exist_ok=True, parents=True)
for i, (turn, _, speaker) in enumerate(diarization.itertracks(yield_label=True)):
try:
clipped = song[turn.start * 1000 : turn.end * 1000]
clipped.export(f"files/{i:03}.wav", format="wav", bitrate=16000)
except Exception as e:
print(e)
output_path = zip_folder("files")
return (text_output, output_path)
def update_audio_ui(audio_source: str) -> tuple[dict, dict]:
mic = audio_source == "microphone"
return (
gr.update(visible=mic, value=None), # input_audio_mic
gr.update(visible=not mic, value=None), # input_audio_file
)
with gr.Blocks(css="style.css") as demo:
gr.Markdown(DESCRIPTION)
with gr.Group():
with gr.Row():
number_of_speakers = gr.Number(
label="Number of Speakers",
info="Keep it zero, if you want the model to automatically detect the number of speakers",
)
with gr.Row() as audio_box:
audio_source = gr.Radio(
choices=["file", "microphone"], value="file", interactive=True
)
input_audio_mic = gr.Audio(
label="Input speech",
type="filepath",
source="microphone",
visible=False,
)
input_audio_file = gr.Audio(
label="Input speech",
type="filepath",
source="upload",
visible=True,
)
final_audio = gr.Audio(label="Output", visible=False)
audio_source.change(
fn=update_audio_ui,
inputs=audio_source,
outputs=[input_audio_mic, input_audio_file],
queue=False,
api_name=False,
)
input_audio_mic.change(lambda x: x, input_audio_mic, final_audio)
input_audio_file.change(lambda x: x, input_audio_file, final_audio)
submit = gr.Button("Submit")
text_output = gr.Textbox(
label="Transcribed Text",
value="",
interactive=False,
lines=10,
scale=10,
max_lines=10,
)
file_output = gr.File(label="output")
submit.click(
fn=predict,
inputs=[
number_of_speakers,
audio_source,
input_audio_mic,
input_audio_file,
],
outputs=[text_output, file_output],
api_name="predict",
)
demo.queue(max_size=50).launch()
It seems that the problem was in my installation.
I used this as
requirements.txt
(found from here):gradio==3.38.0 --extra-index-url https://download.pytorch.org/whl/cu113 torch==2.0.1 pyannote-audio==3.1.0
And this for Dockerfile.
FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04 ENV DEBIAN_FRONTEND=noninteractive RUN apt-get update && \ apt-get upgrade -y && \ apt-get install -y --no-install-recommends \ git \ git-lfs \ wget \ curl \ # python build dependencies \ build-essential \ libssl-dev \ zlib1g-dev \ libbz2-dev \ libreadline-dev \ libsqlite3-dev \ libncursesw5-dev \ xz-utils \ tk-dev \ libxml2-dev \ libxmlsec1-dev \ libffi-dev \ liblzma-dev \ # gradio dependencies \ ffmpeg \ ca-certificates \ # fairseq2 dependencies \ libsndfile-dev && \ apt-get clean && \ rm -rf /var/lib/apt/lists/* RUN useradd -m -u 1000 user USER user ENV HOME=/home/user \ PATH=/home/user/.local/bin:${PATH} WORKDIR ${HOME} RUN git clone https://github.com/yyuu/pyenv.git .pyenv ENV PATH=${HOME}/.pyenv/shims:${HOME}/.pyenv/bin:${PATH} ARG PYTHON_VERSION=3.10 RUN pyenv install ${PYTHON_VERSION} && \ pyenv global ${PYTHON_VERSION} && \ pyenv rehash && \ pip install --no-cache-dir -U pip setuptools wheel COPY --chown=1000 ./requirements.txt /tmp/requirements.txt RUN pip install --no-cache-dir --upgrade -r /tmp/requirements.txt COPY --chown=1000 . ${HOME}/app ENV PYTHONPATH=${HOME}/app \ PYTHONUNBUFFERED=1 \ GRADIO_ALLOW_FLAGGING=never \ GRADIO_NUM_PORTS=1 \ GRADIO_SERVER_NAME=0.0.0.0 \ GRADIO_THEME=huggingface \ SYSTEM=spaces \ GRADIO_SERVER_PORT=7860 EXPOSE 7860 WORKDIR ${HOME}/app CMD ["python", "app.py"]
I do not know if it is using GPU or not. But without this, It took around 90 minutes to process a 110 minute file. Now, It takes around 1~2 minutes.
this worked for me too.
specifically, what i did was create a requirements.txt
file:
with the contents:
gradio==3.38.0
--extra-index-url https://download.pytorch.org/whl/cu113
torch==2.0.1
pyannote-audio==3.1.0
Then install it with pip install -r requirements.txt
.
Now, I can run some simple code:
In [1]: from pyannote.audio import Pipeline
In [2]: import torch
In [3]: pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
torchvision is not available - cannot save figures
In [4]: pipeline.to(torch.device("cuda"))
Out[4]: <pyannote.audio.pipelines.speaker_diarization.SpeakerDiarization at 0x7f2ce8f143d0>
In [5]: diarization = pipeline("/tmp/tmphgpfklya.wav")
And $ nvidia-smi -l 1
shows:
It took me quite a while to find this solution. Should it be added to README? Why is this version of torch
required for the GPU to be properly utilized?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
What is more weird on my side that 3.1 model works sometimes on GPU, sometimes on CPU, but 3.0 model always works on GPU. So, I specifically wrote a bit of code to choose between models. I always start with 3.1 because it does the segmentation faster. But then, if I see within 5 seconds that it's using CPU instead of GPU, I cancel that and re-run it with 3.0. Who knows...