LocalAI Diarization endpoint

Is your feature request related to a problem? Please describe.

Currently, I dont believe there are any opensource API endpoints for a diarization pipeline. I think localAI adding one would fill a hole in the current offerings. The lack of such endpoints requires local use of AI models, disrupting an entirely remote, API driven workflow.

Describe the solution you'd like

Projects like NeMo, pyannote, and Diart offer diarization workflows that could be linked to an api endpoint to provide remote diarization.

Describe alternatives you've considered

Additional context

Jan 26 '24 12:01 benniekiss

Really liking the idea, adding it to the roadmap :+1:

Jan 26 '24 15:01 mudler

I will try to implement this in the next few months. Pyannote now has an SDK for its cloud service, but I still have not found a local running API service.

My aim for the moment will be a /diarization endpoint to generate a diarization response (list of {speaker, start, stop} objects) and optionally a list of speaker embeddings. I may also try adding a /voiceprint endpoint to generate an embedding of an uploaded audio file.

Mar 20 '25 19:03 benniekiss

https://github.com/QuentinFuxa/WhisperLiveKit

Sep 11 '25 05:09 richiejp

Thanks for taking interest in this! Ive looked into whisper livekit, but it just uses NeMo and Pyannote under the hood :)

I wanted to get to this earlier, but I never got around to figuring out how LocalAi plumbs the backends to the API. The python portion is relatively simple for pyannote, its just needs the additional API routes

Sep 11 '25 11:09 benniekiss

I was rethinking the proposed endpoints, and I think this feature could be improved by the following design:

refactor /vad endpoints to enabling using the pyannote segmentation model
add a /vad/embedding endpoint to generate embeddings of audio files using pyannote, NeMo, etc models.
add /diarization endpoint that calls the two above pipelines and clusters the outputs (k-means, VBx, etc - reference pyannote), Returns a list of {idx, speaker, start, stop} objects

These could all be implemented separately, as well

Sep 12 '25 14:09 benniekiss

that all looks good to me.

Do you know if the embeddings would be able to identify speakers between audio clips? This would help when processing voice commands.

Sep 13 '25 10:09 richiejp

You would have to keep a database yourself to compare embeddings, just like you would with text.

Sep 13 '25 11:09 benniekiss

LocalAI LocalAI copied to clipboard

Diarization endpoint

LocalAI
LocalAI copied to clipboard