NeMo
NeMo copied to clipboard
Diarization | IndexError: shape mismatch: indexing tensors could not be broadcast together
Describe the bug
When running diarization on a specific file I'm getting:
IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [7557001], [7559750]
The same pipeline/config/setup etc has worked for a couple of thousand other files but now on this (and a few more files) I'm getting this issue suddenly.
Steps/Code to reproduce bug
Shared a Google Colab of how one can reproduce the error, the steps are also listed below.
Shared Google Colab notebook ipynb
Steps:
Pre-steps:
Upload issue_file.wav
, config.yaml
and speech_timestamps.rttm
to runtime (e.g. Colab)
Above mentioned files can be obtained:
issue_file.wav
: Audio file
config.yaml
: diarization config
speech_timestamps.rttm
: speech timestamps rttm
Let me know if anything's missing or unclear.
!apt-get update && apt-get install -y libsndfile1 ffmpeg
!pip install nemo_toolkit['asr']
import os
import torch
import yaml
import json
from omegaconf import OmegaConf
from nemo.collections.asr.models import ClusteringDiarizer
def diarize(workdir: str, rttm_filepath):
manifest_path = os.path.join(workdir, "manifest.json")
output_dir = os.path.join(workdir, "output")
manifest = {
'audio_filepath': '/content/issue_file.wav',
'offset': 0,
'duration': None,
'label': 'infer',
'text': '-',
'num_speakers': None,
'rttm_filepath': rttm_filepath,
'uem_filepath': None,
}
with open('/content/config.yaml', "r") as config_file:
config_dict = yaml.load(config_file, Loader=yaml.FullLoader)
config = OmegaConf.create(config_dict['diarizer'])
config.device = "cuda:0" if torch.cuda.is_available() else "cpu"
config.diarizer.manifest_filepath = manifest_path
config.diarizer.oracle_vad = True
config.diarizer.speaker_embeddings.model_path = 'titanet_large'
config.diarizer.out_dir = output_dir
with open(manifest_path, "w") as manifest_file:
json.dump(manifest, manifest_file)
model = ClusteringDiarizer(cfg=config)
model.diarize()
diarize('/content/', '/content/speech_timestamps.rttm')
This fails:
[<ipython-input-7-4afd3b7bf15c>](https://localhost:8080/#) in diarize(workdir, rttm_filepath)
30
31 model = ClusteringDiarizer(cfg=config)
---> 32 model.diarize()
[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/clustering_diarizer.py](https://localhost:8080/#) in diarize(self, paths2audio_files, batch_size)
454
455 # Clustering
--> 456 all_reference, all_hypothesis = perform_clustering(
457 embs_and_timestamps=embs_and_timestamps,
458 AUDIO_RTTM_MAP=self.AUDIO_RTTM_MAP,
[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/utils/speaker_utils.py](https://localhost:8080/#) in perform_clustering(embs_and_timestamps, AUDIO_RTTM_MAP, out_rttm_dir, clustering_params, device, verbose)
483 base_scale_idx = uniq_embs_and_timestamps['multiscale_segment_counts'].shape[0] - 1
484
--> 485 cluster_labels = speaker_clustering.forward_infer(
486 embeddings_in_scales=uniq_embs_and_timestamps['embeddings'],
487 timestamps_in_scales=uniq_embs_and_timestamps['timestamps'],
[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/utils/longform_clustering.py](https://localhost:8080/#) in forward_infer(self, embeddings_in_scales, timestamps_in_scales, multiscale_segment_counts, multiscale_weights, oracle_num_speakers, max_rp_threshold, max_num_speakers, enhanced_count_thres, sparse_search_volume, fixed_thres, chunk_cluster_count, embeddings_per_chunk)
407 )
408 else:
--> 409 cluster_labels = self.speaker_clustering.forward_infer(
410 embeddings_in_scales=embeddings_in_scales,
411 timestamps_in_scales=timestamps_in_scales,
[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/utils/offline_clustering.py](https://localhost:8080/#) in forward_infer(self, embeddings_in_scales, timestamps_in_scales, multiscale_segment_counts, multiscale_weights, oracle_num_speakers, max_num_speakers, max_rp_threshold, enhanced_count_thres, sparse_search_volume, fixed_thres, kmeans_random_trials)
1376 )
1377
-> 1378 return self.forward_unit_infer(
1379 mat=mat,
1380 oracle_num_speakers=oracle_num_speakers,
[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/utils/offline_clustering.py](https://localhost:8080/#) in forward_unit_infer(self, mat, oracle_num_speakers, max_num_speakers, max_rp_threshold, sparse_search_volume, est_num_of_spk_enhanced, fixed_thres, kmeans_random_trials)
1225 if mat.shape[0] > self.min_samples_for_nmesc:
1226 est_num_of_spk, p_hat_value = nmesc.forward()
-> 1227 affinity_mat = getAffinityGraphMat(mat, p_hat_value)
1228 else:
1229 nmesc.fixed_thres = max_rp_threshold
[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/utils/offline_clustering.py](https://localhost:8080/#) in getAffinityGraphMat(affinity_mat_raw, p_value)
349 symmetrize the binarized graph matrix.
350 """
--> 351 X = affinity_mat_raw if p_value <= 0 else getKneighborsConnections(affinity_mat_raw, p_value)
352 symm_affinity_mat = 0.5 * (X + X.T)
353 return symm_affinity_mat
[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/utils/offline_clustering.py](https://localhost:8080/#) in getKneighborsConnections(affinity_mat, p_value, mask_method)
332 indices_col = torch.arange(dim[1]).repeat(p_value, 1).T.flatten()
333 if mask_method == 'binary' or mask_method is None:
--> 334 binarized_affinity_mat[indices_row, indices_col] = (
335 torch.ones(indices_row.shape[0]).to(affinity_mat.device).half()
336 )
IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [7557001], [7559750]
Expected behavior
Diarization shouldn't fail on a seemingly non-corrupt audio-file. This config has been tested multiple times before on other files without any issues.
Environment overview (please complete the following information)
- Environment location: Google Colab ( Freshly created )
- Method of NeMo install:
!apt-get update && apt-get install -y libsndfile1 ffmpeg
!pip install nemo_toolkit['asr']
Environment details
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
=============
2.1.0+cu121
=============
Python 3.10.12
Hi.
Let us test on the wav file you provided.
This is a new type of error we have never encountered.
It apprears p_value
value in this line is causing this error.
I will follow your settings and check what is causing this error.
@tango4j any updates on this issue?
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.