diart Trying to extract embeds

Hi,

I am trying to run a pipeline to extract embeddings

The pipeline I am running is the one in the README:

import rx.operators as ops
import diart.operators as dops
from diart.sources import MicrophoneAudioSource
from diart.blocks import SpeakerSegmentation, OverlapAwareSpeakerEmbedding

segmentation = SpeakerSegmentation.from_pretrained("pyannote/segmentation")
embedding = OverlapAwareSpeakerEmbedding.from_pretrained("pyannote/embedding")
mic = MicrophoneAudioSource()

stream = mic.stream.pipe(
    # Reformat stream to 5s duration and 500ms shift
    dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
    ops.map(lambda wav: (wav, segmentation(wav))),
    ops.starmap(embedding)
).subscribe(on_next=lambda emb: print(emb.shape))

mic.read()

Although SegmentationModel has no attribute sample_rate

    dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),

    Traceback (most recent call last):
      File "T:\Projects\endospeech_RD\IdentifySpeechToText\obtain_embeddings.py", line 11, in <module>
        dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
    AttributeError: 'SegmentationModel' object has no attribute 'sample_rate'

So I tried replacing it with

    dops.rearrange_audio_stream(sample_rate=44100),

and all I get from output is :

# (batch_size, num_speakers, embedding_dim)
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])

Not sure why it is detecting 3 speakers when I am the only one talking. The entire output confuses me.

Any help is appreciated.

Edit : I did come across https://github.com/juanmc2005/diart/issues/214 but I still am not sure how to actually perform the embedding extraction.

Edit 2 :

).subscribe(on_next=lambda emb: print(emb))

Taking out .shape does print out values:

tensor([[[-0.0517, -0.0178, -0.0477,  ..., -0.0572, -0.0540, -0.0226],
         [-0.0517, -0.0178, -0.0477,  ..., -0.0572, -0.0540, -0.0226],
         [-0.0517, -0.0178, -0.0477,  ..., -0.0572, -0.0540, -0.0226]]])
tensor([[[-0.0507, -0.0086, -0.0534,  ..., -0.0544, -0.0962,  0.0316],
         [-0.0571, -0.0187, -0.0451,  ..., -0.0532, -0.0596, -0.0159],
         [-0.0571, -0.0187, -0.0451,  ..., -0.0532, -0.0596, -0.0159]]])
tensor([[[-0.0604, -0.0138, -0.0483,  ..., -0.0615, -0.0730, -0.0237],
         [-0.0603, -0.0138, -0.0479,  ..., -0.0614, -0.0728, -0.0243],
         [-0.0603, -0.0138, -0.0479,  ..., -0.0614, -0.0728, -0.0243]]])

May 28 '24 20:05 hmehdi515

Hi @hmehdi515,

First of all, your sample rate should be 16000. The example in the README must be old. I removed the sample_rate attribute from the model to make it easier to integrate custom models. Would you mind creating a PR to fix the example? It would be greatly appreciated!

Concerning the 3-speaker output, this is normal and depends on how many maximum speakers are predicted by the segmentation model. In this case, the segmentation output is a matrix of (num_speakers=3, num_frames). To get the embeddings corresponding to "active" speakers you should filter depending on the segmentation activation. For example, in the diarization pipeline we use the tau_active threshold which applies the following rule: if any predicted speaker S has at least 1 frame where the probability of speech p(S) satisfies p(S) >= tau_active, then S is considered "active" and we keep its embedding.

Bear in mind that this is not necessarily the best rule for every use case, so I encourage you to try different alternatives.

May 30 '24 09:05 juanmc2005

Thanks for your help. I submitted a PR with some changes.

Do you know how to change the num_speakers on SpeakerSegmentation? I know that we could create a config for SpeakerDiarization, not sure if we can do something similar for SpeakerSegmentation.

May 30 '24 17:05 hmehdi515

Changing the number of speakers would require to re-train the segmentation model or fine-tuning it to produce a matrix of a different size (adding or removing speaker rows)

Jun 28 '24 21:06 juanmc2005

diart diart copied to clipboard

Trying to extract embeds

diart
diart copied to clipboard