diart icon indicating copy to clipboard operation
diart copied to clipboard

Trying to extract embeds

Open hmehdi515 opened this issue 1 year ago • 3 comments

Hi,

I am trying to run a pipeline to extract embeddings

The pipeline I am running is the one in the README:

import rx.operators as ops
import diart.operators as dops
from diart.sources import MicrophoneAudioSource
from diart.blocks import SpeakerSegmentation, OverlapAwareSpeakerEmbedding

segmentation = SpeakerSegmentation.from_pretrained("pyannote/segmentation")
embedding = OverlapAwareSpeakerEmbedding.from_pretrained("pyannote/embedding")
mic = MicrophoneAudioSource()

stream = mic.stream.pipe(
    # Reformat stream to 5s duration and 500ms shift
    dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
    ops.map(lambda wav: (wav, segmentation(wav))),
    ops.starmap(embedding)
).subscribe(on_next=lambda emb: print(emb.shape))

mic.read()

Although SegmentationModel has no attribute sample_rate

    dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),

    Traceback (most recent call last):
      File "T:\Projects\endospeech_RD\IdentifySpeechToText\obtain_embeddings.py", line 11, in <module>
        dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
    AttributeError: 'SegmentationModel' object has no attribute 'sample_rate'

So I tried replacing it with

    dops.rearrange_audio_stream(sample_rate=44100),

and all I get from output is :

# (batch_size, num_speakers, embedding_dim)
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])

Not sure why it is detecting 3 speakers when I am the only one talking. The entire output confuses me.

Any help is appreciated.

Edit : I did come across https://github.com/juanmc2005/diart/issues/214 but I still am not sure how to actually perform the embedding extraction.

Edit 2 :

).subscribe(on_next=lambda emb: print(emb))

Taking out .shape does print out values:

tensor([[[-0.0517, -0.0178, -0.0477,  ..., -0.0572, -0.0540, -0.0226],
         [-0.0517, -0.0178, -0.0477,  ..., -0.0572, -0.0540, -0.0226],
         [-0.0517, -0.0178, -0.0477,  ..., -0.0572, -0.0540, -0.0226]]])
tensor([[[-0.0507, -0.0086, -0.0534,  ..., -0.0544, -0.0962,  0.0316],
         [-0.0571, -0.0187, -0.0451,  ..., -0.0532, -0.0596, -0.0159],
         [-0.0571, -0.0187, -0.0451,  ..., -0.0532, -0.0596, -0.0159]]])
tensor([[[-0.0604, -0.0138, -0.0483,  ..., -0.0615, -0.0730, -0.0237],
         [-0.0603, -0.0138, -0.0479,  ..., -0.0614, -0.0728, -0.0243],
         [-0.0603, -0.0138, -0.0479,  ..., -0.0614, -0.0728, -0.0243]]])

hmehdi515 avatar May 28 '24 20:05 hmehdi515