vosk-api icon indicating copy to clipboard operation
vosk-api copied to clipboard

How to use Vosk Punctuation Model in C#

Open securigy opened this issue 1 year ago • 6 comments

I've been googling and browsing all day long but cannot find how to use Vosk Punctuation models, especially in C#. Is it supported at all? If yes, any example?

I am also looking for an answer to the following question: Using Speaker Models - is it possible without training, that is, based on differences in voice pitch, and some other audio characteristics, etc.

securigy avatar Mar 20 '23 22:03 securigy

Not yet, we are working on universal punctuation to use from other languages, but it take time.

For speaker models, you can use pretrained model, yes. They detect pitch differences and map them to xvector.

nshmyrev avatar Mar 21 '23 07:03 nshmyrev

Punctuation - got it.

Speaker models - that's a shame, because I do not have pretrained model. I was hoping that there is a generic model that can detect difference in voice pitch... Making my own is beyond my knowledge and capability at this time...

securigy avatar Mar 21 '23 16:03 securigy

Making my own is beyond my knowledge and capability at this time...

It is in downloads, see

https://alphacephei.com/vosk/models/vosk-model-spk-0.4.zip

nshmyrev avatar Mar 21 '23 16:03 nshmyrev

For usage see https://github.com/alphacep/vosk-api/issues/405

nshmyrev avatar Mar 21 '23 20:03 nshmyrev

Well, the model is there, but it is absolutely not clear how to recognize one person speaking from another... There are some py codes, but I have no idea still about all the numbers and comparisons needed to be made to achieve that.. So I have to drop it for now...

BTW, is there any way to delegate work to GPU? Do I need to recognize in code first that I have adequate GPU and if yes, how?

securigy avatar Mar 22 '23 01:03 securigy

2 days were wasted. Vosk is really good at transcribing voice to text. But I think speaker recognition is not ready yet. There is neither a proper source nor an example. Everyone has written something from every angle, but it is all empty. I think it is necessary to prepare a detailed document for speaker recognition.

rehberim360 avatar May 02 '24 08:05 rehberim360