LocalAI icon indicating copy to clipboard operation
LocalAI copied to clipboard

Diarization endpoint

Open benniekiss opened this issue 1 year ago • 7 comments

Is your feature request related to a problem? Please describe.

Currently, I dont believe there are any opensource API endpoints for a diarization pipeline. I think localAI adding one would fill a hole in the current offerings. The lack of such endpoints requires local use of AI models, disrupting an entirely remote, API driven workflow.

Describe the solution you'd like

Projects like NeMo, pyannote, and Diart offer diarization workflows that could be linked to an api endpoint to provide remote diarization.

Describe alternatives you've considered

Additional context

benniekiss avatar Jan 26 '24 12:01 benniekiss

Really liking the idea, adding it to the roadmap :+1:

mudler avatar Jan 26 '24 15:01 mudler

I will try to implement this in the next few months. Pyannote now has an SDK for its cloud service, but I still have not found a local running API service.

My aim for the moment will be a /diarization endpoint to generate a diarization response (list of {speaker, start, stop} objects) and optionally a list of speaker embeddings. I may also try adding a /voiceprint endpoint to generate an embedding of an uploaded audio file.

benniekiss avatar Mar 20 '25 19:03 benniekiss

https://github.com/QuentinFuxa/WhisperLiveKit

richiejp avatar Sep 11 '25 05:09 richiejp

Thanks for taking interest in this! Ive looked into whisper livekit, but it just uses NeMo and Pyannote under the hood :)

I wanted to get to this earlier, but I never got around to figuring out how LocalAi plumbs the backends to the API. The python portion is relatively simple for pyannote, its just needs the additional API routes

benniekiss avatar Sep 11 '25 11:09 benniekiss

I was rethinking the proposed endpoints, and I think this feature could be improved by the following design:

  • refactor /vad endpoints to enabling using the pyannote segmentation model

  • add a /vad/embedding endpoint to generate embeddings of audio files using pyannote, NeMo, etc models.

  • add /diarization endpoint that calls the two above pipelines and clusters the outputs (k-means, VBx, etc - reference pyannote), Returns a list of {idx, speaker, start, stop} objects

These could all be implemented separately, as well

benniekiss avatar Sep 12 '25 14:09 benniekiss

that all looks good to me.

Do you know if the embeddings would be able to identify speakers between audio clips? This would help when processing voice commands.

richiejp avatar Sep 13 '25 10:09 richiejp

You would have to keep a database yourself to compare embeddings, just like you would with text.

benniekiss avatar Sep 13 '25 11:09 benniekiss