DeepSpeech
DeepSpeech copied to clipboard
Speaker attribution and voice classification
Enhancement: Speaker attribution and voice classification
- identify and label speakers against voices in the training set.
- cluster similar voices.
We can consider this, but generally we're narrowly focused on STT.
Transcription that also label the speakers in the audio for each word or dialogue duration can have useful applications and would relate to metadata for STT.
So maybe having a contrib
repo, that is an extension to STT which will host features that are related to speech in general. But would result in a powerful DeepSpeech library to cater all kinds of speech related problems.
What do you think?
@div5yesh A contrib
repo is a reasonable idea. But the we still have to consider the bandwidth required for review and tests of contrib
code and howcontrib
is updated across non-backward compatible releases.
What's your take @reuben and @lissyx?
I have a hard time figuring out exactly how those pieces would have to stick together. IMHO, my experience with such kind of contrib repo is really mixed feelings: often broken, badly maintained. Provides poor user / dev experience, generates frustrations. I understand the need for the feature, but it requires extending the API. How would a contrib repo, in the end, should be integrated to provide that ?
@lissyx Your concerns are valid. I think extending the API would be reasonable. Currently, STT has pretty limited use case. Sure, adding more data to the output would definitely allow to address a few more use cases. I believe, having an API to do deep analysis of the speech itself to extract information (not just text) could be more useful.
I believe, having an API to do deep analysis of the speech itself to extract information (not just text) could be more useful.
We're not saying it's not useful :)
I think extending the API would be reasonable.
Well, we have a Metadata
exposed struct that you might be able to extend and experiment with if you are interested.
I'd also be interested in this
It is an interesting enhancement and there are quite many existing work related to this as well. One of the notable one (which I am using it as a separate fork in my org) is this one: https://github.com/resemble-ai/Resemblyzer
I would be happy to contribute to the feature of speaker identification and classification after I make myself with the DeepSpeech's codebase
FTR, I've come accross this work https://github.com/ina-foss/inaSpeechSegmenter that might be useful in this context.
I think that Speaker Diarization would be the most useful here as it does not require any training data. The X-Vector model is one of the most powerful implementations, see the Kaldi model
How to use it if you can give an example. as input wav file?
This may or may not help - https://github.com/tyiannak/pyAudioAnalysis