NeMo ZeroDivisionError when diarizing a short file

Describe the bug

I am getting a ZeroDivisionError: float division by zero error when diarizing a very short short file with prior voice activity detection:

File "nemo/collections/asr/models/clustering_diarizer.py", line 467, in diarize return score_labels( File "nemo/collections/asr/metrics/der.py", line 171, in score_labels CER = metric['confusion'] / metric['total'] ZeroDivisionError: float division by zero

I've diarized lots of files of various length and this error seems to occur only on really short files that have identified speech them. I understand that diarization of such short files is fundamentally questionable, but the type of error suggests that this might be an edge case bug worth reporting.

Steps/Code to reproduce bug

Please refer to this google colab notebook to reproduce the bug.

The files to reproduce the bug can be found here.

!apt-get update && apt-get install -y libsndfile1 ffmpeg
!pip install nemo_toolkit['asr']

import os
import torch
import yaml
import json
from omegaconf import OmegaConf
from nemo.collections.asr.models import ClusteringDiarizer

def diarize(workdir: str, rttm_filepath):

    manifest_path  = os.path.join(workdir, "manifest.json")
    output_dir     = os.path.join(workdir, "output")

    manifest =  {
        'audio_filepath': '/content/diarization_test_file.wav',
        'offset': 0,
        'duration': None,
        'label': 'infer',
        'text': '-',
        'num_speakers': None,
        'rttm_filepath': rttm_filepath,
        'uem_filepath': None,
    }

    with open('/content/config.yaml', "r") as config_file:
        config_dict = yaml.load(config_file, Loader=yaml.FullLoader)

    config = OmegaConf.create(config_dict['diarizer'])
    config.device = "cuda:0" if torch.cuda.is_available() else "cpu"
    config.diarizer.manifest_filepath = manifest_path
    config.diarizer.oracle_vad = True
    config.diarizer.speaker_embeddings.model_path = 'titanet_large'
    config.diarizer.out_dir = output_dir


    with open(manifest_path, "w") as manifest_file:
        json.dump(manifest, manifest_file)

    model = ClusteringDiarizer(cfg=config)
    model.diarize()

diarize('/content/', '/content/speech_timestamps.rttm')

[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/metrics/der.py](https://localhost:8080/#) in score_labels(AUDIO_RTTM_MAP, all_reference, all_hypothesis, collar, ignore_overlap, verbose)
    169 
    170         DER = abs(metric)
--> 171         CER = metric['confusion'] / metric['total']
    172         FA = metric['false alarm'] / metric['total']
    173         MISS = metric['missed detection'] / metric['total']

ZeroDivisionError: float division by zero

Expected behavior

Diarization should not fail on files of short length.

Environment overview (please complete the following information)

Environment location: Google Colab
Method of NeMo install:

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install nemo_toolkit['asr']

Environment details

NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
2.1.0+cu121
Python 3.10.12

Feb 14 '24 10:02 RafalCer

Hi @tango4j, would it be possible for you to have a look at this error? Thanks!

Mar 01 '24 08:03 RafalCer

This error seems like happening since the audio is containing 0 second of RTTM target evaluation duration. We cannot reproduce the error unless we have access to the audio input. @RafalCer Could you check the rttm file you provide? if it is containing 0 second duration in RTTM, it will throw out this error. If possible, can you copy and paste the RTTM file for this file?

Mar 04 '24 22:03 tango4j

Thanks for your response!

Please note that you can find the audio, rttm and config here: https://drive.google.com/drive/folders/1e2b14isJ8CQ0NR3mGqHcmkLlo4D2T38E?usp=drive_link.

Please let me know should the link not work.

The content of rttm file is as follows:

SPEAKER 0 1 0.29 0.48 <NA> <NA> <NA> <NA> <NA>

It does have speech, albeit a very short segment, which is too short for diarization either way. However, the error is somewhat inconvenient when diarization is part of a pipeline used for files of any duration. Is it possible to resolve it somehow without modifying the source code?

Mar 05 '24 08:03 RafalCer

Thanks for sharing the samples. We have plenty of issue traffic so it will take some time to try the samples and fix it, but we will definitely try this sample to see what is the issue. Seems like the diarization pipeline currently cannot handle the extremely short samples (0.5 second is shorter than the shortest segment length). Also, could you please clarify the error is somewhat inconvenient ? Do you mean the way error is calculated is inconvenient?

Mar 05 '24 19:03 tango4j

Thank you so much, I really appreciate your time!

I see, the short segment length might indeed be causing this, as then metric['total'] is probably 0, thus the ZeroDivisionError: float division by zero.

Could you please suggest what would be a good solution for this - should the evaluation not be called when the sample is too short, or should some value other than 0 be used for the calculation? I will try to look into this and post a PR if I manage to find the time.

By saying the error is somewhat inconvenient, I meant that the error in this case seems to arise at evaluation of diarization rather than diarization itself (I might be wrong on this one). Perhaps it's possible to instead of failing the diarization on short segments to assign the short segment to one speaker by default?

Mar 07 '24 09:03 RafalCer

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Apr 07 '24 01:04 github-actions[bot]

Thank you so much, I really appreciate your time!

I see, the short segment length might indeed be causing this, as then metric['total'] is probably 0, thus the ZeroDivisionError: float division by zero.

Could you please suggest what would be a good solution for this - should the evaluation not be called when the sample is too short, or should some value other than 0 be used for the calculation? I will try to look into this and post a PR if I manage to find the time.

By saying the error is somewhat inconvenient, I meant that the error in this case seems to arise at evaluation of diarization rather than diarization itself (I might be wrong on this one). Perhaps it's possible to instead of failing the diarization on short segments to assign the short segment to one speaker by default?

Hi, have you managed to solve this bug? I have the same problem here. It is caused by a wav file like this:

reference RTTM: SPEAKER e1d3d1e5-fd82-a0da-20de-51a0a558a8ee 1 0.000 1.800 <NA> <NA> 1 <NA> <NA>
SPEAKER e1d3d1e5-fd82-a0da-20de-51a0a558a8ee 1 0.400 1.100 <NA> <NA> 0 <NA> <NA>

predict RTTM: SPEAKER e1d3d1e5-fd82-a0da-20de-51a0a558a8ee 1 0.060 0.270 <NA> <NA> speaker_0 <NA> <NA> SPEAKER e1d3d1e5-fd82-a0da-20de-51a0a558a8ee 1 0.780 0.430 <NA> <NA> speaker_1 <NA> <NA>

When i set config.diarizer.collar = 0, the problem disappear. So i believe that the "collar" of pyannote.metrics.DiarizationErrorRate is larger than the duration of a predicted segment, which caused this bug. Do you have any idea about how to solve this? Simply set collar=0 is a acceptable solution, but the DER will be always larger than the DER computed with collar=0.25*2 by default.

Apr 09 '24 03:04 hhd52859

Thank you so much, I really appreciate your time! I see, the short segment length might indeed be causing this, as then metric['total'] is probably 0, thus the ZeroDivisionError: float division by zero. Could you please suggest what would be a good solution for this - should the evaluation not be called when the sample is too short, or should some value other than 0 be used for the calculation? I will try to look into this and post a PR if I manage to find the time. By saying the error is somewhat inconvenient, I meant that the error in this case seems to arise at evaluation of diarization rather than diarization itself (I might be wrong on this one). Perhaps it's possible to instead of failing the diarization on short segments to assign the short segment to one speaker by default?

Hi, have you managed to solve this bug? I have the same problem here. It is caused by a wav file like this:

reference RTTM: SPEAKER e1d3d1e5-fd82-a0da-20de-51a0a558a8ee 1 0.000 1.800 1 SPEAKER e1d3d1e5-fd82-a0da-20de-51a0a558a8ee 1 0.400 1.100 0

predict RTTM: SPEAKER e1d3d1e5-fd82-a0da-20de-51a0a558a8ee 1 0.060 0.270 speaker_0 SPEAKER e1d3d1e5-fd82-a0da-20de-51a0a558a8ee 1 0.780 0.430 speaker_1

When i set config.diarizer.collar = 0, the problem disappear. So i believe that the "collar" of pyannote.metrics.DiarizationErrorRate is larger than the duration of a predicted segment, which caused this bug. Do you have any idea about how to solve this? Simply set collar=0 is a acceptable solution, but the DER will be always larger than the DER computed with collar=0.25*2 by default.

Hi, sorry for such late response.

Sorry, I did not have the time to look into this issue myself. But thank you so much for the tip regarding the collar value. It has indeed solved the issue on short files.

Apr 24 '24 14:04 RafalCer

NeMo NeMo copied to clipboard

ZeroDivisionError when diarizing a short file

NeMo
NeMo copied to clipboard