NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Speaker Diarization with Marblenet and ClusterDiarizer issue

Open ddebnath228 opened this issue 2 years ago • 3 comments

Describe the bug I am not getting correct time stamps for speech segment and many speech chunks are removed. I am using pretrained Marblenet and speakerdiarization_speakernet models. It removes lots of speech data in between the chunks as below:

['0.0 0.23 speaker_1', '1.55 1.99 speaker_2', '3.18 4.25 speaker_2', '5.77 11.35 speaker_1',

Steps/Code to reproduce bug

from nemo.collections.asr.models import ClusteringDiarizer sd_model = ClusteringDiarizer(cfg=cfg) sd_model.diarize()

It generates RTTM file with speaker and timestamp which itself is wrong. Tried with multiple CFG parameter changes but no change in the output.

Expected behavior

The chunks should be continous one after another so that no loss of actual speech data.

Environment overview (please complete the following information)

  • Environment location: Collab
  • Method of NeMo install: [pip install or from source]. PIP install from git as per the tutorials and git
  • If method of install is [Docker], NO

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

  • OS version Windows 10
  • PyTorch version 1.11
  • Python version 3.7

Additional context

Add any other context about the problem here. Example: GPU model

ddebnath228 avatar Jul 10 '22 09:07 ddebnath228

diarizationExample.zip Uploaded the Wav file too

ddebnath228 avatar Jul 10 '22 09:07 ddebnath228

VAD and Speaker Embedding extractor models you used are outdated. Which NeMo version are you using? Please use vad_telephony_marblenet for VAD and titanet_large for Speaker Embedding extractor

nithinraok avatar Jul 11 '22 18:07 nithinraok

Seems like this is not a bug/issue in NeMo pipeline (changed to question), in the worst case this could be the limit of model performance.

tango4j avatar Aug 05 '22 19:08 tango4j