audapolis
audapolis copied to clipboard
Speaker diarization with pyannote.audio?
I am the creator of pyannote.audio
speaker diarization toolkit.
I understand that you went with @josepatino's PyBK because of its speed but I'd love to see pyannote.audio
pretrained pipeline integrated into audapolis
.
Would that be of interest to you? I'd love to help in some way!
Hey, thanks for reaching out. We would be interested in integrating this 100%. When I started looking at speaker diarization I also noticed pyannote.audio
, but as there was no pretrained pipeline at the time we decided against it.
Do you think it would be possible to extend the SpeakerDiarization
-pipeline to not only report the individual steps of the pipeline via the hook, but also the progress within certain steps? This would be a huge benefit for us
I have been meaning to add this kind of progress hook for the online demo but it never really reached the top of my priority list.
Those are the two steps that really make most of the processing time:
-
speaker segmentation that relies on the
Inference
class that already has some kind of support for progress hook -
speaker embedding (which is basically a
for
loop that should easily be adapted to support a progress hook.
FYI, I just released a much faster/more accurate version of pyannote.audio
speaker diarization pipeline. It still does not expose the progress of the individual steps but this is now on my TODO list (though with no ETA).
Wow, I just tried it (and opened https://github.com/pyannote/pyannote-audio/pull/1185/files for the progress). The results are really impressive 😍
I started integrating it and stumbled upon a problem which I'm currently not sure how to solve, so if you have any idea @hbredin, I would be very interested: audapolis currently works on the assumption that there is only one "speaker" at any time. pyannote-audio on the other hand supports multiple speakers at the same time. It therefor produces overlaps between the speakers.
Since changing audapolis to support multiple speakers is too much for now, I'm trying to "flatten" the output of pyannote-audio to 1 speaker at a time. Do you have a suggestion on how to do that properly?
Nothing built in pyannote
comes to mind.
You'd have to postprocess the pyannote.core.Annotation
instance returned by the pipeline:
- remove any segment fully contained by a larger segment
[------A-------] ==> [------A-------]
[--B--]
- split partially overlapping segments in two halves
[----A----] ==> [----A--]
[----B----] [--B----]
Or you could clip the output of the speaker counting step to be at most 1.
count.data = np.clip(count.data, 0, 1)
should do the trick...