Performance tips
Hi, this isn't really an issue, but I'm running Vosk on an Anki Vector robot - it has a Qualcomm APQ8009 CPU, which has 4 Cortex-A7 cores. It runs embedded Linux built with Yocto.
After some tuning (mainly just limiting the grammar), it actually runs at an acceptable speed, however I'd like to squeeze as much performance as I can. I am wondering if you have any tips.
The only real difference between it and a Pi 2 is a strictly-softfp environment, in case that affects anything. That isn't due to any limitations of the CPU, just has to do with proprietary Qualcomm blobs needing to run in the same environment. From what I can tell, this doesn't prevent it from actually using an FPU, it just uses integer registers rather than fp registers.
I am essentially using your Android build script, except modified to use my own (GCC 10) toolchain built with crosstool-ng.
Here are my current findings:
neonandneon-vfpv4don't seem to cause a difference in performanceUSE_THREAD=1 NUM_THREADS=4actually compiles and runs, but makes the performance worse- The en-US zamia model runs considerably faster than the regular small en-US model
- probably not as accurate, but it's accurate enough for my application
- SetEndpointerDelays, SetWords, SetPartialWords don't really seem to do anything
- I am using my own webrtcvad-based VAD implementation, and am just feeding stuff to Vosk then getting a FinalResult once my VAD has detected the end of speech. That's probably why EndpointerDelays does nothing.
This is my current Go code, in the case you see something which could be tuned:
func InitVosk() {
loadIntents()
var err error
model, err = vosk.NewModel("/anki/data/assets/cozmo_resources/cloudless/en-US/model")
if err != nil {
log.Fatal("model not found", err)
}
rec, err = vosk.NewRecognizerGrm(model, 16000, GetGrammerList("en-US"))
if err != nil {
log.Fatal("error making rec:", err)
}
// does this actually do anything
rec.SetMaxAlternatives(0)
rec.SetEndpointerDelays(3, 0, 0)
}
func Process(chunk []byte) string {
if len(chunk) == 0 {
fmt.Println("empty chunk")
return ""
}
// todo: experiment with giving acceptwaveform smaller or bigger chunks
stop, _ := DetectEndOfSpeech(chunk)
rec.AcceptWaveform(chunk)
if stop {
var jres map[string]interface{}
json.Unmarshal([]byte(rec.FinalResult()), &jres)
transcribedText := jres["text"].(string)
fmt.Println("transcribed text: " + transcribedText)
return transcribedText
}
return ""
}
How complex is the grammar you need to handle? How many words/target phrases?
I have a big JSON of intent:utterances. I am going through all of these, spliting the utterances by " ", seeing if the word exists in the model, then adding it to a grammar list.
This robot has commands like "how are you," "good robot," "set a timer for <number> minutes," and "what's the weather"
This is from one of my other projects which uses Vosk as the STT engine. As I didn't want to train a new model, I just put in as many misunderstandings as I could, which is why there are many strange words in here.
For further context, I created wire-pod, which is an external voice server for the robot. I am pretty much just trying to see if I can do all voice processing on the robot instead.
The final list for en-US is:
"name", "is", "native", "names", "name's", "my", "weather", "whether", "the", "other", "water", "no", "forecast", "tomorrow", "whats", "who", "am", "i", "eye", "color", "colo", "call", "her", "foller", "ichor", "agricola", "change", "oracular", "oracle", "set", "your", "to", "older", "how", "old", "are", "you", "or", "yo", "there", "start", "owing", "tailoring", "exploring", "charge", "home", "go", "church", "find", "ch", "charger", "flee", "sleep", "sheep", "morning", "mourning", "mooning", "it", "bore", "afternoon", "after", "noon", "whom", "good", "night", "might", "goodnight", "bye", "by", "buy", "goodbye", "fireworks", "new", "year", "happy", "have", "been", "now", "never", "knew", "bennie", "he", "holds", "christmas", "behold", "holiday", "in", "intellect", "fine", "alex", "ing", "an", "elect", "angelica", "up", "alexa", "sign", "outlet", "of", "out", "ale", "forward", "for", "ward", "word", "move", "forwards", "around", "one", "eighty", "ate", "turn", "left", "e", "ed", "ernest", "right", "ernie", "credit", "roll", "cu", "all", "human", "yorke", "cube", "pop", "a", "w", "wieland", "do", "wheel", "doorstone", "powell", "willie", "really", "o'", "billy", "wheelie", "stand", "this", "pomp", "pump", "bump", "book", "with", "first", "fifth", "were", "fifteen", "if", "wisdom", "bu", "fist", "bomb", "ball", "system", "black", "cards", "game", "play", "blackjack", "yes", "correct", "sure", "please", "dont", "thanks", "photo", "selby", "capture", "picture", "take", "me", "awesome", "also", "as", "some", "them", "battle", "t", "rob", "ro", "amazing", "robot", "bad", "that", "ad", "root", "hate", "horrible", "sorry", "apologize", "apologise", "tory", "nevermind", "mind", "im", "back", "backwards", "beck", "down", "volume", "quieter", "louder", "stare", "at", "loudness", "shut", "hello", "our", "follow", "far", "about", "low", "loo", "come", "here", "love", "dove", "question", "weston", "conversation", "lets", "talk", "let's", "check", "timer", "time", "checked", "stop", "cancel", "clock", "be", "stopped", "what", "quiet", "dance", "dancing", "thence", "beat", "boogie", "music", "pickup", "pick", "fetch", "bring", "trick", "something", "cool", "thing", "record", "message", "method", "hit", "stan", "keep", "away", "day", "tonight", "purple", "blue", "sapphire", "yellow", "teal", "tell", "green", "orange", "self", "medium", "normal", "regular", "high", "loud", "mute", "nothing", "silent", "off", "zero", "'s", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "sixteen", "seventeen", "eighteen", "nineteen", "twenty", "thirty", "fourty", "fifty", "sixty", "seventy", "ninety", "hundred", "hour", "minute", "second", "forty", "seconds", "minutes", "hours"
Is this large for a grammar list? It could realistically be cut down by quite a bit.
I also found out that I can tune model.conf by reducing max-beam and beam. That improves performance a bit