vosk-api icon indicating copy to clipboard operation
vosk-api copied to clipboard

Where is the documentation?

Open dyc3 opened this issue 4 years ago • 11 comments

All I've found is the examples in each folder. Is there no proper documentation with all the functions/classes/modules listed out?

dyc3 avatar May 10 '20 23:05 dyc3

Man, the API is just one pager https://github.com/alphacep/vosk-api/blob/master/src/vosk_api.h, but I'll write you something coming days.

nshmyrev avatar May 12 '20 19:05 nshmyrev

Sure, that's great for C/C++, but it's not obvious how the bindings for other languages work just from that header file, especially if you aren't used to reading or writing C. Even then, some of the function names do not necessarily clearly convey what they do, or side effects that they may have.

For example, it's unclear what the difference is between vosk_recognizer_result, vosk_recognizer_partial_result, and vosk_recognizer_final_result, and when these functions are used in Python, they return dictionaries instead of strings (as that header would imply), the schema of which is not documented.

dyc3 avatar May 13 '20 16:05 dyc3

Hey, any update on this? I'm working with a Spring Boot application and would like to integrate VOSK.

zhangsjacob avatar Jul 02 '20 15:07 zhangsjacob

Hey, any update on this? I'm working with a Spring Boot application and would like to integrate VOSK.

Its getting improved, there is C sample for example. Let me know what else is missing. It would be nice to see Spring Boot integration in place.

nshmyrev avatar Jul 02 '20 15:07 nshmyrev

Related #405

nshmyrev avatar Feb 16 '21 17:02 nshmyrev

I echo the comment above that it's unclear what the difference is between vosk_recognizer_result, vosk_recognizer_partial_result, and vosk_recognizer_final_result. If I'm only interested in a final result, that doesn't seem possible. You have to iterate through and get all the intermediate results and put them together yourself.

And what is the significance in the readFrames(4000) call. Is it just reading 4000 frames at a time? What if that occurs on a word boundary? Is there going to be an error? I would have expected it to be able to read a stream and detect the pauses to break up the text inself. Is there any significance to the 4000? Can any number be used? Is it constrained only by memory? Advantages or disadvantages to other numbers?

Obviously, you've done a great job on this. We're all just trying to figure out how to use it :-)

peterkronenberg avatar Feb 16 '21 20:02 peterkronenberg

@peterkronenberg from my experience, it does just that without itching on those buffer borders, You can omit getting partial_result and use result for getting jsons on the fly without partials, final_result is used to flush buffer right now (to manually state that the current phrase is over and start next buffer from a new line, not waiting for silence)

there is no significance in 4000, i`m using 3200(bytes) for example and it works fine, u can feed it even 2 bytes (sample size) straight every time i think, but this would cause much overhead in memory operations and computing result, so choose a reasonable buffer size for recognizer to work in portions (and that buffer would be streamfed)

LuggerMan avatar Feb 18 '21 14:02 LuggerMan

What exactly is the boolean return code from the acceptWaveform supposed to represent? If true, we get a result, if false, we get a partial result. What exactly does the partial result represent?

if acceptWaveform returns false and we just get partial results, when do we get the result for that iteration? Is it lost?

And it's just not clear to me why, if acceptWaveform is true and a result is available, that it doesn't just give me the complete result instead of having to make another call to getFinalResult. Guess I'm just trying to understand a little better what it is doing internally

peterkronenberg avatar Feb 18 '21 15:02 peterkronenberg

@peterkronenberg i think it is something to do with on a phrase level (lookup kaldi recognition internals, three layers - phonemes, words, phrases), so 1 on return means vosk has done accepting a phrase and can return a recognized phrase with reliable confidence. Partial result is more about word layer, not context-dependant and incremental (first words in a phrase are more representative of it as you can see partial result changing its values while accepting parts of the word).

if it doesnt (which is rare) you can feed it additional zero buffer or flush with final_result, but i think just always getting full result without partial works fine too (test this pls)

LuggerMan avatar Feb 19 '21 09:02 LuggerMan

Any update on this? It would really help to see what each function does?

Vrajs16 avatar May 24 '22 00:05 Vrajs16

It looks like someone has documented the vosk_api.h file since this issue was originally opened. I think this documentation is adequate. Probably just linking to this from the main README.md should be enough to close this issue.

As a sidenote, this rust wrapper created some really good documentation as well.

gregtzar avatar Feb 17 '24 02:02 gregtzar