pyannote-audio icon indicating copy to clipboard operation
pyannote-audio copied to clipboard

How to finetune clustering and embedding models in speaker diarization pipeline?

Open rkapur102 opened this issue 1 year ago • 7 comments

Hi @hbredin , how can I finetune the clustering and embedding models in the SpeakerDiarization pipeline? All tutorials only refer to finetuning the segmentation model. Any help would be appreciated.

rkapur102 avatar Dec 04 '23 04:12 rkapur102

Thank you for your issue. We found the following entry in the FAQ which you may find helpful:

Feel free to close this issue if you found an answer in the FAQ.

If your issue is a feature request, please read this first and update your request accordingly, if needed.

If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:

  • installation
  • data preparation
  • model download
  • etc.

Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).

Companies relying on pyannote.audio in production may contact me via email regarding:

  • paid scientific consulting around speaker diarization and speech processing in general;
  • custom models and tailored features (via the local tech transfer office).

This is an automated reply, generated by FAQtory

github-actions[bot] avatar Dec 04 '23 04:12 github-actions[bot]

Fine-tuning speaker embedding is currently not implemented as pyannote relies on external libraries for that part.

You can however tune the clustering threshold to your use case. This tutorial may help.

hbredin avatar Dec 04 '23 19:12 hbredin

@hbredin is there a way to finetune the speaker embedding model separately and then pass it into the pyannote pipeline? I saw it is the ECAPA-TDNN model. Can that be finetuned? It seems it can be trained from scratch, but I'm looking to finetune it. Can I finetune it on my own and pass in that new finetuned embedding model to the "embedding" parameter in SpeakerDiarization()? Do you know of any tutorials on this?

rkapur102 avatar Dec 04 '23 20:12 rkapur102

I think this is a question for speechbrain project.

hbredin avatar Dec 04 '23 21:12 hbredin

If I understand this correctly (and I may not) Diarization pipeline 3.0 seems to use the WeChat embeddings; older versions of the pipeline seem to use the Speechbrain version. I am a bit confused as the Plaquet paper which otherwise seems to be a good description of the pipeline still uses the Speechbrain embeddings, but maybe things changed when the pipeline became available on huggingface......

picheny-nyu avatar Dec 13 '23 03:12 picheny-nyu

Plaquet's paper comes with a companion repository (https://github.com/FrenchKrab/IS2023-powerset-diarization) that does include a pipeline based on speechbrain ECAPA-TDNN.

hbredin avatar Dec 13 '23 11:12 hbredin

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 11 '24 04:06 stale[bot]