DeepSpeech icon indicating copy to clipboard operation
DeepSpeech copied to clipboard

Speaker attribution and voice classification

Open div5yesh opened this issue 5 years ago • 12 comments

Enhancement: Speaker attribution and voice classification

  • identify and label speakers against voices in the training set.
  • cluster similar voices.

div5yesh avatar Jun 12 '19 20:06 div5yesh

We can consider this, but generally we're narrowly focused on STT.

kdavis-mozilla avatar Jun 13 '19 07:06 kdavis-mozilla

Transcription that also label the speakers in the audio for each word or dialogue duration can have useful applications and would relate to metadata for STT.

So maybe having a contrib repo, that is an extension to STT which will host features that are related to speech in general. But would result in a powerful DeepSpeech library to cater all kinds of speech related problems.

What do you think?

div5yesh avatar Jun 13 '19 15:06 div5yesh

@div5yesh A contrib repo is a reasonable idea. But the we still have to consider the bandwidth required for review and tests of contrib code and howcontrib is updated across non-backward compatible releases.

What's your take @reuben and @lissyx?

kdavis-mozilla avatar Jun 14 '19 08:06 kdavis-mozilla

I have a hard time figuring out exactly how those pieces would have to stick together. IMHO, my experience with such kind of contrib repo is really mixed feelings: often broken, badly maintained. Provides poor user / dev experience, generates frustrations. I understand the need for the feature, but it requires extending the API. How would a contrib repo, in the end, should be integrated to provide that ?

lissyx avatar Jun 14 '19 08:06 lissyx

@lissyx Your concerns are valid. I think extending the API would be reasonable. Currently, STT has pretty limited use case. Sure, adding more data to the output would definitely allow to address a few more use cases. I believe, having an API to do deep analysis of the speech itself to extract information (not just text) could be more useful.

div5yesh avatar Jun 18 '19 14:06 div5yesh

I believe, having an API to do deep analysis of the speech itself to extract information (not just text) could be more useful.

We're not saying it's not useful :)

I think extending the API would be reasonable.

Well, we have a Metadata exposed struct that you might be able to extend and experiment with if you are interested.

lissyx avatar Jun 18 '19 15:06 lissyx

I'd also be interested in this

rhamnett avatar Jun 23 '19 23:06 rhamnett

It is an interesting enhancement and there are quite many existing work related to this as well. One of the notable one (which I am using it as a separate fork in my org) is this one: https://github.com/resemble-ai/Resemblyzer

I would be happy to contribute to the feature of speaker identification and classification after I make myself with the DeepSpeech's codebase

shashankpr avatar Sep 24 '19 11:09 shashankpr

FTR, I've come accross this work https://github.com/ina-foss/inaSpeechSegmenter that might be useful in this context.

lissyx avatar Oct 09 '19 09:10 lissyx

I think that Speaker Diarization would be the most useful here as it does not require any training data. The X-Vector model is one of the most powerful implementations, see the Kaldi model

diego-fustes avatar Nov 22 '19 11:11 diego-fustes

How to use it if you can give an example. as input wav file?

Tortoise17 avatar Nov 22 '19 11:11 Tortoise17

This may or may not help - https://github.com/tyiannak/pyAudioAnalysis

shravanshetty1 avatar Nov 16 '20 10:11 shravanshetty1