NeMo
NeMo copied to clipboard
Speaker Diarization with Marblenet and ClusterDiarizer issue
Describe the bug I am not getting correct time stamps for speech segment and many speech chunks are removed. I am using pretrained Marblenet and speakerdiarization_speakernet models. It removes lots of speech data in between the chunks as below:
['0.0 0.23 speaker_1', '1.55 1.99 speaker_2', '3.18 4.25 speaker_2', '5.77 11.35 speaker_1',
Steps/Code to reproduce bug
from nemo.collections.asr.models import ClusteringDiarizer sd_model = ClusteringDiarizer(cfg=cfg) sd_model.diarize()
It generates RTTM file with speaker and timestamp which itself is wrong. Tried with multiple CFG parameter changes but no change in the output.
Expected behavior
The chunks should be continous one after another so that no loss of actual speech data.
Environment overview (please complete the following information)
- Environment location: Collab
- Method of NeMo install: [pip install or from source]. PIP install from git as per the tutorials and git
- If method of install is [Docker], NO
Environment details
If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:
- OS version Windows 10
- PyTorch version 1.11
- Python version 3.7
Additional context
Add any other context about the problem here. Example: GPU model
diarizationExample.zip Uploaded the Wav file too
VAD and Speaker Embedding extractor models you used are outdated. Which NeMo version are you using? Please use vad_telephony_marblenet for VAD and titanet_large for Speaker Embedding extractor
Seems like this is not a bug/issue in NeMo pipeline (changed to question), in the worst case this could be the limit of model performance.