transformers.js
transformers.js copied to clipboard
Absolute speaker diarization?
Question
I've just managed to integrate the new speaker diarization feature into my project. Very cool stuff. My goal is to let people record meetings, summarize them, and then also list per-speaker tasks. This seems to be a popular feature.
One thing I'm running into is that I don't feed Whisper a single long audio file. Instead I use VAD to feed it small chunks of live audio whenever someone speaks.
However, as far as I can tell the speaker diarization only works "relatively", detecting speakers within a single audio file.
Is there a way to let it detect and 'sort' the correct speaker over multiple audio files? Perhaps it could remember the 'audio fingerprints' of the speakers somehow?