LocalAI
LocalAI copied to clipboard
Diarization endpoint
Is your feature request related to a problem? Please describe.
Currently, I dont believe there are any opensource API endpoints for a diarization pipeline. I think localAI adding one would fill a hole in the current offerings. The lack of such endpoints requires local use of AI models, disrupting an entirely remote, API driven workflow.
Describe the solution you'd like
Projects like NeMo, pyannote, and Diart offer diarization workflows that could be linked to an api endpoint to provide remote diarization.
Describe alternatives you've considered
Additional context
Really liking the idea, adding it to the roadmap :+1:
I will try to implement this in the next few months. Pyannote now has an SDK for its cloud service, but I still have not found a local running API service.
My aim for the moment will be a /diarization endpoint to generate a diarization response (list of {speaker, start, stop} objects) and optionally a list of speaker embeddings. I may also try adding a /voiceprint endpoint to generate an embedding of an uploaded audio file.
https://github.com/QuentinFuxa/WhisperLiveKit
Thanks for taking interest in this! Ive looked into whisper livekit, but it just uses NeMo and Pyannote under the hood :)
I wanted to get to this earlier, but I never got around to figuring out how LocalAi plumbs the backends to the API. The python portion is relatively simple for pyannote, its just needs the additional API routes
I was rethinking the proposed endpoints, and I think this feature could be improved by the following design:
-
refactor
/vadendpoints to enabling using the pyannote segmentation model -
add a
/vad/embeddingendpoint to generate embeddings of audio files using pyannote, NeMo, etc models. -
add
/diarizationendpoint that calls the two above pipelines and clusters the outputs (k-means, VBx, etc - reference pyannote), Returns a list of{idx, speaker, start, stop}objects
These could all be implemented separately, as well
that all looks good to me.
Do you know if the embeddings would be able to identify speakers between audio clips? This would help when processing voice commands.
You would have to keep a database yourself to compare embeddings, just like you would with text.