diart
diart copied to clipboard
Trying to extract embeds
Hi,
I am trying to run a pipeline to extract embeddings
The pipeline I am running is the one in the README:
import rx.operators as ops
import diart.operators as dops
from diart.sources import MicrophoneAudioSource
from diart.blocks import SpeakerSegmentation, OverlapAwareSpeakerEmbedding
segmentation = SpeakerSegmentation.from_pretrained("pyannote/segmentation")
embedding = OverlapAwareSpeakerEmbedding.from_pretrained("pyannote/embedding")
mic = MicrophoneAudioSource()
stream = mic.stream.pipe(
# Reformat stream to 5s duration and 500ms shift
dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
ops.map(lambda wav: (wav, segmentation(wav))),
ops.starmap(embedding)
).subscribe(on_next=lambda emb: print(emb.shape))
mic.read()
Although SegmentationModel has no attribute sample_rate
dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
Traceback (most recent call last):
File "T:\Projects\endospeech_RD\IdentifySpeechToText\obtain_embeddings.py", line 11, in <module>
dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
AttributeError: 'SegmentationModel' object has no attribute 'sample_rate'
So I tried replacing it with
dops.rearrange_audio_stream(sample_rate=44100),
and all I get from output is :
# (batch_size, num_speakers, embedding_dim)
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
Not sure why it is detecting 3 speakers when I am the only one talking. The entire output confuses me.
Any help is appreciated.
Edit : I did come across https://github.com/juanmc2005/diart/issues/214 but I still am not sure how to actually perform the embedding extraction.
Edit 2 :
).subscribe(on_next=lambda emb: print(emb))
Taking out .shape does print out values:
tensor([[[-0.0517, -0.0178, -0.0477, ..., -0.0572, -0.0540, -0.0226],
[-0.0517, -0.0178, -0.0477, ..., -0.0572, -0.0540, -0.0226],
[-0.0517, -0.0178, -0.0477, ..., -0.0572, -0.0540, -0.0226]]])
tensor([[[-0.0507, -0.0086, -0.0534, ..., -0.0544, -0.0962, 0.0316],
[-0.0571, -0.0187, -0.0451, ..., -0.0532, -0.0596, -0.0159],
[-0.0571, -0.0187, -0.0451, ..., -0.0532, -0.0596, -0.0159]]])
tensor([[[-0.0604, -0.0138, -0.0483, ..., -0.0615, -0.0730, -0.0237],
[-0.0603, -0.0138, -0.0479, ..., -0.0614, -0.0728, -0.0243],
[-0.0603, -0.0138, -0.0479, ..., -0.0614, -0.0728, -0.0243]]])