Whisper-WebUI icon indicating copy to clipboard operation
Whisper-WebUI copied to clipboard

WhisperX - Speaker Diarization

Open mjtechguy opened this issue 1 year ago • 19 comments
trafficstars

I think it would be great to be able to leverage WhisperX and speaker diarization. Any plans to do this?

https://github.com/m-bain/whisperX

mjtechguy avatar Jun 10 '24 17:06 mjtechguy

Hi, I've made a TODO list in the README and added it. I'll work on it later!

jhj0517 avatar Jun 12 '24 07:06 jhj0517

I'm testing whisperX and listing some issues here:

  • Incompatible torch version
    • whisperX models were trained on torch 1.10.0+cu102 and this WebUI uses torch 2.3.1+cu121
  • Slow transcription
    • This may be due to an incompatible torch, but it was much slower than other implementations.
      16.5 sec for 30 secs of audio input with large-v2

jhj0517 avatar Jun 22 '24 14:06 jhj0517

@jhj0517 looking at the speaker diarization it seems that it uses a different model from HF, so it can be integrated without the whisperX model @mjtechguy

moda20 avatar Jun 23 '24 13:06 moda20

Yes, it seems that whisperX post-process diarization with the result of the faster-whisper. So I think I should modularize the diarization and integrate it with faster-whisper.

jhj0517 avatar Jun 23 '24 18:06 jhj0517

Speaker diarization is now enabled in #181.

Diarization is embedded into the text with | divider. For example,

w/ diarization:

1
00:00:00,000 --> 00:00:04,879
SPEAKER_00|Now, as all books not primarily intended as picture books

2
00:00:04,879 --> 00:00:08,880
SPEAKER_00|consist principally of types composed to form letterpress,

w/o diarization:

1
00:00:00,000 --> 00:00:04,879
Now, as all books not primarily intended as picture books

2
00:00:04,879 --> 00:00:08,880
consist principally of types composed to form letterpress,

Note : To download diarization model for the first time, you need Huggignface Token and mannually go to https://huggingface.co/pyannote/speaker-diarization-3.1 and agree to their terms.

jhj0517 avatar Jun 26 '24 13:06 jhj0517

@jhj0517 trying the latest version with diarization, but I am getting this error, it seems it downloaded the model but it didn't finish the diarization.

2024-06-26T19:36:55.526316618Z Traceback (most recent call last):
2024-06-26T19:36:55.526636537Z   File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/queueing.py", line 527, in process_events
2024-06-26T19:36:55.526654992Z     response = await route_utils.call_process_api(
2024-06-26T19:36:55.526661835Z                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-26T19:36:55.526667215Z   File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/route_utils.py", line 270, in call_process_api
2024-06-26T19:36:55.526672605Z     output = await app.get_blocks().process_api(
2024-06-26T19:36:55.526677936Z              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-26T19:36:55.526685630Z   File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/blocks.py", line 1856, in process_api
2024-06-26T19:36:55.526693645Z     data = await self.postprocess_data(fn_index, result["prediction"], state)
2024-06-26T19:36:55.526700999Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-26T19:36:55.526709536Z   File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/blocks.py", line 1634, in postprocess_data
2024-06-26T19:36:55.526717781Z     self.validate_outputs(fn_index, predictions)  # type: ignore
2024-06-26T19:36:55.526725736Z     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-26T19:36:55.526734253Z   File "/Whisper-WebUI/venv/lib/python3.11/site-packages/gradio/blocks.py", line 1610, in validate_outputs
2024-06-26T19:36:55.526743881Z     raise ValueError(
2024-06-26T19:36:55.526752507Z ValueError: An event handler (transcribe_file) didn't receive enough output values (needed: 2, received: 1).
2024-06-26T19:36:55.526760563Z Wanted outputs:
2024-06-26T19:36:55.526767927Z     [<gradio.components.textbox.Textbox object at 0x78caea9c2590>, <gradio.templates.Files object at 0x78caea231350>]
2024-06-26T19:36:55.526795169Z Received outputs:
2024-06-26T19:36:55.526800238Z     [None]

moda20 avatar Jun 26 '24 19:06 moda20

@moda20 Can you show the full log before the Traceback? This could happen if the model failed to load.

To use pyannote model, you need to go to the

  1. https://huggingface.co/pyannote/speaker-diarization-3.1
  2. https://huggingface.co/pyannote/segmentation-3.0

and manually accept its terms and enter the Huggingface token..

It may be inconvenient, but it's their requirement for now. I hope there is a better way than this.

jhj0517 avatar Jun 26 '24 20:06 jhj0517

@jhj0517 Yes, accepting the conditions of the second segmentation HF model, did the trick. i didn't see it in the README, that's why

~EDIT : i am able to transcribe using small and small.en only. i run into the same error message as before for anything beyond those. Also, i don't get any logs before that error, although i am using the docker version of the web-ui so it might be the reason why.~ Wrong alert it was a VRAM issue

moda20 avatar Jun 26 '24 20:06 moda20

@moda20 Trying to run diarization models with CPU may help in that case. You can change the device in the dropdown.

image

jhj0517 avatar Jun 26 '24 20:06 jhj0517

accepted both terms of service for the stated models and added read token then it gives an error

Tom-Neverwinter avatar Jun 26 '24 22:06 Tom-Neverwinter

When the file format is TXT, the first character of the output is hidden by the speaker delimiter This may be difficult to understand in Japanese, but it is as follows.

w/ diarization: SPEAKER_04|部科学省の数理データサイエンスAI教育プログラム認定制度に SPEAKER_04|ータサイエンス教育プログラムの所持申請を行ったという報告がありまして、

w/o diarization: 文部科学省の数理データサイエンスAI教育プログラム認定制度に データサイエンス教育プログラムの所持申請を行ったという報告がありまして、

cookiexND avatar Jun 27 '24 01:06 cookiexND

@cookiexND Thanks for reporting this. It's fixed in #183

@Tom-Neverwinter Can you provide more information about the error you received?

jhj0517 avatar Jun 27 '24 06:06 jhj0517

Hi there, I guess i'm in the same boat as @Tom-Neverwinter or @moda20 before... This is the output in my case:

Error transcribing file: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, ort, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: auto Traceback (most recent call last): File "/root/Whisper-WebUI-Mac/Whisper-WebUI/venv/lib/python3.12/site-packages/gradio/queueing.py", line 527, in process_events response = await route_utils.call_process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/Whisper-WebUI-Mac/Whisper-WebUI/venv/lib/python3.12/site-packages/gradio/route_utils.py", line 270, in call_process_api output = await app.get_blocks().process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/Whisper-WebUI-Mac/Whisper-WebUI/venv/lib/python3.12/site-packages/gradio/blocks.py", line 1856, in process_api data = await self.postprocess_data(fn_index, result["prediction"], state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/Whisper-WebUI-Mac/Whisper-WebUI/venv/lib/python3.12/site-packages/gradio/blocks.py", line 1634, in postprocess_data self.validate_outputs(fn_index, predictions) # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/Whisper-WebUI-Mac/Whisper-WebUI/venv/lib/python3.12/site-packages/gradio/blocks.py", line 1610, in validate_outputs raise ValueError( ValueError: An event handler (transcribe_file) didn't receive enough output values (needed: 2, received: 1). Wanted outputs: [<gradio.components.textbox.Textbox object at 0x7820bd082930>, <gradio.templates.Files object at 0x7820bd082960>] Received outputs: [None]

I accepted the TOS under https://huggingface.co/pyannote/speaker-diarization-3.1 https://huggingface.co/pyannote/segmentation-3.0 And created a token and set read permissions. But I'm not quiet sure if I have done it right. Could you perhaps instruct how to manually download the models and where to put them?

Thank you!

linuxlurak avatar Aug 29 '24 20:08 linuxlurak

@linuxlurak the first error seems to be that the device to use for transcribing is not selected, be sure you have selected the cpu or gpu in the UI selector.

the second error (Received outputs: [None]) is a higher level error that can be triggered by a multitude of issues.

moda20 avatar Aug 29 '24 20:08 moda20

Thanks! I'm not sure where to select the device... In the gradio webui there is only cpu to select in Diarization section. I'm runnig your project in a proxmox container, FYI. No noteworthy GPU available.

linuxlurak avatar Aug 29 '24 20:08 linuxlurak

And to add: transcription works flawlessly. Only diarization doesn't.

linuxlurak avatar Aug 29 '24 20:08 linuxlurak

Hi @linuxlurak. I tried to fix the bug in #244, can you check the latest version? You can update the WebUI with update.sh.

jhj0517 avatar Aug 30 '24 06:08 jhj0517

Hi @jhj0517 Thanks, I checked out your fix. It works!

FYI: I had to install pytubefix python module in whisper-ui's venv after updating with update.sh.

linuxlurak avatar Aug 30 '24 16:08 linuxlurak

Another question but kind of off topic: could you lead me to the part of the code in app.py, where I can set the number of threads or cups? By default app.py uses only 4 cpu I guess.

linuxlurak avatar Aug 30 '24 18:08 linuxlurak