diart
diart copied to clipboard
Trying to extract embeds
Hi,
I am trying to run a pipeline to extract embeddings
The pipeline I am running is the one in the README:
import rx.operators as ops
import diart.operators as dops
from diart.sources import MicrophoneAudioSource
from diart.blocks import SpeakerSegmentation, OverlapAwareSpeakerEmbedding
segmentation = SpeakerSegmentation.from_pretrained("pyannote/segmentation")
embedding = OverlapAwareSpeakerEmbedding.from_pretrained("pyannote/embedding")
mic = MicrophoneAudioSource()
stream = mic.stream.pipe(
# Reformat stream to 5s duration and 500ms shift
dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
ops.map(lambda wav: (wav, segmentation(wav))),
ops.starmap(embedding)
).subscribe(on_next=lambda emb: print(emb.shape))
mic.read()
Although SegmentationModel has no attribute sample_rate
dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
Traceback (most recent call last):
File "T:\Projects\endospeech_RD\IdentifySpeechToText\obtain_embeddings.py", line 11, in <module>
dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
AttributeError: 'SegmentationModel' object has no attribute 'sample_rate'
So I tried replacing it with
dops.rearrange_audio_stream(sample_rate=44100),
and all I get from output is :
# (batch_size, num_speakers, embedding_dim)
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
Not sure why it is detecting 3 speakers when I am the only one talking. The entire output confuses me.
Any help is appreciated.
Edit : I did come across https://github.com/juanmc2005/diart/issues/214 but I still am not sure how to actually perform the embedding extraction.
Edit 2 :
).subscribe(on_next=lambda emb: print(emb))
Taking out .shape does print out values:
tensor([[[-0.0517, -0.0178, -0.0477, ..., -0.0572, -0.0540, -0.0226],
[-0.0517, -0.0178, -0.0477, ..., -0.0572, -0.0540, -0.0226],
[-0.0517, -0.0178, -0.0477, ..., -0.0572, -0.0540, -0.0226]]])
tensor([[[-0.0507, -0.0086, -0.0534, ..., -0.0544, -0.0962, 0.0316],
[-0.0571, -0.0187, -0.0451, ..., -0.0532, -0.0596, -0.0159],
[-0.0571, -0.0187, -0.0451, ..., -0.0532, -0.0596, -0.0159]]])
tensor([[[-0.0604, -0.0138, -0.0483, ..., -0.0615, -0.0730, -0.0237],
[-0.0603, -0.0138, -0.0479, ..., -0.0614, -0.0728, -0.0243],
[-0.0603, -0.0138, -0.0479, ..., -0.0614, -0.0728, -0.0243]]])
Hi @hmehdi515,
First of all, your sample rate should be 16000. The example in the README must be old. I removed the sample_rate attribute from the model to make it easier to integrate custom models. Would you mind creating a PR to fix the example? It would be greatly appreciated!
Concerning the 3-speaker output, this is normal and depends on how many maximum speakers are predicted by the segmentation model. In this case, the segmentation output is a matrix of (num_speakers=3, num_frames). To get the embeddings corresponding to "active" speakers you should filter depending on the segmentation activation. For example, in the diarization pipeline we use the tau_active threshold which applies the following rule: if any predicted speaker S has at least 1 frame where the probability of speech p(S) satisfies p(S) >= tau_active, then S is considered "active" and we keep its embedding.
Bear in mind that this is not necessarily the best rule for every use case, so I encourage you to try different alternatives.
Thanks for your help. I submitted a PR with some changes.
Do you know how to change the num_speakers on SpeakerSegmentation? I know that we could create a config for SpeakerDiarization, not sure if we can do something similar for SpeakerSegmentation.
Changing the number of speakers would require to re-train the segmentation model or fine-tuning it to produce a matrix of a different size (adding or removing speaker rows)