NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Speaker Verification

Open iddqd2d opened this issue 3 years ago • 0 comments

Hi!

I trained the model using: https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb

import nemo.collections.asr as nemo_asr
from omegaconf import OmegaConf
import torch
import pytorch_lightning as pl
import os

speaker_model = nemo_asr.models.EncDecSpeakerLabelModel.load_from_checkpoint('/home/denis/ttt/result/TitaNet/2022-08-05_08-47-00/checkpoints/TitaNet--val_loss=2.3548-epoch=9-last.ckpt')
#decision = speaker_model.verify_speakers('/home/denis/ttt/data/an4/wav/an4_clstk/mjes/cen7-mjes-b.wav','/home/denis/ttt/data/an4/wav/an4_clstk/mjes/cen2-mjes-b.wav')
decision = speaker_model.verify_speakers('/home/denis/ttt/data/an4/wav/an4_clstk/mjes/cen7-mjes-b.wav','data/an4/wav/an4test_clstk/fcaw/an406-fcaw-b.wav')

print(decision)

It works!

I can compare two files. Is there a method to compare all my speakers (many files) with another single file? Do I need to use a loop or another method?

iddqd2d avatar Aug 09 '22 08:08 iddqd2d

You can use this function to get batch embeddings https://github.com/NVIDIA/NeMo/blob/4cd9b3449cbfedc671348fbabbe8e3a55fbd659d/nemo/collections/asr/models/label_models.py#L420

Once you get embeddings you can compare those embeddings using cosine similarity score. For example, you can view this script to see how its done: https://github.com/NVIDIA/NeMo/blob/4cd9b3449cbfedc671348fbabbe8e3a55fbd659d/examples/speaker_tasks/recognition/voxceleb_eval.py#L73

nithinraok avatar Aug 12 '22 00:08 nithinraok

Senks. It works! I have two audio files (same speaker). I get embeddings for each audio file in one embedding file (embeddings.pkl). Do I should merge embeddings if it`s same speaker? Sorry, for a stupid question.

How duration(seconds) the audio file should be?

iddqd2d avatar Aug 12 '22 09:08 iddqd2d

Depends on your use case. You could try averaging cosine scores or average embeddings of each utterance per speaker ( if you have many samples per speaker).

There is no constraint on the duration of the file, it can fall in the range of (1 sec, 20 sec] or more than that

nithinraok avatar Aug 12 '22 17:08 nithinraok

Another stupid question. Here are the embeddings of the same speaker. How to average or combine them? Give me an example, please

an4_clstk@[email protected]
[-7.7393e-02  7.1899e-02 -1.5686e-02  2.3895e-02  2.4094e-02  2.8503e-02
  1.6495e-02  6.8893e-03  4.8065e-02 -1.1658e-02  7.3853e-02  7.6904e-02
 -3.3722e-02 -2.4994e-02  8.6426e-02  8.9493e-03 -3.0289e-02 -1.7603e-01
 -2.8412e-02 -1.1163e-01 -1.4580e-02  5.7373e-02 -6.9519e-02 -1.2688e-02
  6.5857e-02 -5.6091e-02  2.1057e-02 -8.9600e-02 -2.0309e-02 -1.7685e-02
 -1.5759e-01  5.5298e-02  5.1880e-02  1.0577e-01 -4.3427e-02 -1.8661e-02
  3.4790e-02 -2.6215e-02  5.2917e-02 -8.8562e-02 -7.4341e-02 -4.7485e-02
 -3.2043e-02 -3.3203e-02 -8.2153e-02  5.3162e-02 -9.3628e-02  4.1733e-03
 -4.2725e-02 -5.4565e-02  6.8420e-02 -5.7190e-02  1.5507e-03 -1.0358e-01
  1.9092e-01  1.3824e-02 -1.2527e-02  2.7069e-02  7.2693e-02  5.3375e-02
 -4.8767e-02  1.0223e-01 -8.2626e-03  9.0759e-02  2.4155e-02  1.8036e-02
  1.8860e-02  4.6936e-02  1.0376e-02  5.4077e-02 -9.9060e-02 -7.7534e-04
  1.1310e-01  1.7290e-03  5.2185e-03 -5.7159e-02  4.2603e-02 -3.7689e-02
  7.6538e-02  2.5711e-02 -8.7524e-02  1.5388e-02  5.1758e-02  8.6365e-02
 -1.3733e-01 -1.2161e-02  6.2622e-02 -1.2561e-01  1.0175e-01 -1.5732e-02
  4.3030e-03  7.7637e-02  8.4991e-03  1.4913e-04  6.3721e-02 -3.8788e-02
 -5.3062e-03 -6.8237e-02  1.9775e-01  5.1941e-02 -8.9111e-02  6.6284e-02
 -2.0782e-02 -3.0121e-02 -1.4313e-02  5.2185e-03  1.0602e-01 -1.1987e-01
  4.7302e-02  4.9011e-02  1.7944e-01 -6.1951e-02 -2.2217e-02  5.2734e-02
  9.0637e-02 -4.0100e-02 -1.0185e-02 -1.3420e-02 -1.5211e-03 -4.0344e-02
  4.9255e-02 -1.2733e-02 -1.5100e-01 -1.9763e-01  6.9763e-02 -9.1309e-02
 -7.5722e-03 -1.3904e-01  1.6602e-02 -9.9976e-02 -6.8726e-02 -6.7749e-03
  1.9882e-02  4.7241e-02  2.9587e-02 -1.3049e-01 -6.9702e-02 -2.0386e-01
 -6.1188e-02 -1.0712e-02 -1.6006e-02 -8.2397e-02 -9.3384e-02 -1.1299e-02
 -1.5540e-01 -2.9129e-02  1.4252e-02  6.0425e-02 -6.0791e-02 -3.9062e-02
  4.9561e-02  9.6436e-03  9.6130e-03  4.5654e-02  4.0558e-02 -5.9937e-02
 -5.4291e-02  4.0894e-02  1.3390e-02  4.1580e-03 -8.9172e-02 -5.7465e-02
 -1.1377e-01 -1.2283e-02  3.7518e-03  9.0088e-02 -4.4189e-02  1.0181e-01
  1.4465e-01  7.9407e-02  1.6272e-01 -4.6051e-02 -4.8065e-02  5.6702e-02
 -2.6337e-02 -4.7485e-02  1.4514e-01  1.3359e-02 -5.5008e-03  3.1921e-02
 -1.6406e-01 -3.9597e-03 -3.4424e-02  6.3049e-02 -5.2002e-02  8.5083e-02
 -8.6212e-03 -1.0583e-01 -4.4136e-03  7.3730e-02 -1.3281e-01  7.2327e-03]
an4_clstk@[email protected]
[-0.09344    0.08215    0.003399  -0.0253     0.06885    0.03345
 -0.01657   -0.01843   -0.007008  -0.0709     0.0504     0.127
 -0.01033   -0.04016    0.04947    0.02902   -0.01639   -0.11926
 -0.01955   -0.0529    -0.04865    0.06335   -0.03406   -0.09686
  0.1472    -0.03247   -0.01927    0.0164    -0.009026   0.011894
 -0.1614    -0.0192     0.03717    0.11725   -0.06158   -0.04156
  0.13      -0.01598    0.03552   -0.07825   -0.0834    -0.06055
 -0.0801    -0.000677  -0.04745    0.0804    -0.0946    -0.009125
 -0.066     -0.05225    0.01304   -0.06027    0.0992    -0.1227
  0.1426    -0.02565   -0.0541    -0.001242   0.0856    -0.0356
 -0.03918    0.06076   -0.05447    0.03375   -0.00906    0.02576
 -0.02682    0.1121     0.04538    0.1519    -0.08435   -0.1095
  0.1168     0.00888    0.02394    0.04117    0.012436   0.01723
  0.1125    -0.01991   -0.0914    -0.01188    0.03168    0.03732
 -0.1384    -0.044      0.0551    -0.093      0.05374   -0.02217
 -0.003479   0.001745   0.02647    0.03424    0.08636   -0.02934
  0.03766   -0.11365    0.1236     0.0417    -0.0258     0.06604
 -0.0696    -0.0324     0.01909    0.001274   0.1032    -0.1181
 -0.05035    0.09766    0.1595    -0.1442    -0.0521     0.004784
  0.11255    0.011505  -0.05356    0.0358     0.00988   -0.002363
 -0.06055    0.02724   -0.1447    -0.2079     0.1046    -0.1378
 -0.0439    -0.0968     0.063     -0.0155    -0.1099    -0.00885
  0.0004249  0.0672     0.0638    -0.1141    -0.0401    -0.10675
 -0.002323   0.01955   -0.0448    -0.0671    -0.0749     0.03134
 -0.0753    -0.07947    0.05814    0.02565   -0.004845  -0.01746
  0.04117   -0.05093   -0.0349    -0.01585    0.03647   -0.067
 -0.03096    0.05692    0.011734  -0.0432    -0.06354   -0.0192
 -0.05814   -0.05106    0.07306    0.08093    0.001845   0.04974
  0.1781     0.08527    0.1061    -0.09827   -0.01003    0.1543
  0.04852   -0.05978    0.089      0.0758     0.01471   -0.0127
 -0.1364    -0.0579     0.00239    0.02454   -0.07983    0.05618
 -0.08746   -0.1178    -0.0962     0.0387    -0.122      0.02303  ]

For speaker verification, it is better to use embeddings of short or long audio duration ?

iddqd2d avatar Aug 15 '22 07:08 iddqd2d

Hi! Should I take the average of the first element of the first array and the first element of the second array?

For speaker verification, it is better to use embeddings of short or long audio duration ?

iddqd2d avatar Aug 17 '22 11:08 iddqd2d

For the above example, both are 192-dimensional vectors you can average along this dimension. You would get a 192-dimensional embedding.

There is no constraint on the duration of the file, it can fall in the range of (1 sec, 20 sec] or more than that. On average you can take about 5 sec

nithinraok avatar Aug 17 '22 15:08 nithinraok

first element from an4_clstk@[email protected] : -7.7393e-02
first element from an4_clstk@[email protected] : -0.09344 Should I : (-7.7393e-02 + -0.09344) / 2, and put to another array?

iddqd2d avatar Aug 18 '22 05:08 iddqd2d

Help with averaging please

iddqd2d avatar Aug 19 '22 07:08 iddqd2d

yes add both arrays, and result will be 192 dimensional embedding

nithinraok avatar Aug 19 '22 14:08 nithinraok