STT Bug: Scorer gets stuck on bogus words

My primary use case is running speech recognition on live audio sources like microphones or system audio (see https://github.com/petewarden/spchcat). I've noticed that the first 30 seconds or so is usually very accurate, but the transcript often seems to get 'stuck' at some point after that, showing a word that doesn't seem accurate and never moving on.

To reproduce this I've captured myself speaking for about a minute, and then fed it through the latest binaries released by Coqui (1.2.0) to demonstrate the expected and actual output. While doing this, I've noticed that disabling the scorer makes the problem go away.

To Reproduce

I have a full Colab at https://colab.research.google.com/drive/1xlCkN5AjcFq9sCTTre7lo1M121u9KzT2 that demonstrates the full repro steps, but the summary is:

!./stt --model ./model.tflite --scorer ./large_vocabulary.scorer --audio repro_capture.wav

This command produces the following output:

hello world this is pete warden and i am trying speech recognition on off on off one of noonoon

Expected behavior

The expected output is:

hello world this is pete warden and i am trying speech recognition on off on off on off on off on off on of on off on

I see this when running the same command as above without the scorer enabled:

!./stt --model ./model.tflite --audio repro_capture.wav

Environment (please complete the following information):

Ubuntu 18.04.5
TensorFlow installed from Coqui binaries.

Feb 10 '22 22:02 petewarden

any idea what might be going on here with the scorer, @Aya-AlJafari ?

Feb 10 '22 23:02 JRMeyer

From @Aya-AlJafari on the Gitter chatroom:

I have worked on a similar issue for OOVs causing the STT to get stuck after encountering such word. This was a result of a fixed huge penalty for OOV words getting the expanded beam to be less probable, and as a result, if the scorer had a similar word it would estimate that OOV, but in case of small vocab scenarios, things get weird.

Encountering such behavior on in-vocab is strange but to figure out what's happening I would need to take a deeper look and I'd start by printing out the scores of each prefix at the timestep where the issue arises next to the LM score for each ngram and see where the inconsistency occurred and move from that point. https://github.com/coqui-ai/STT/blob/main/native_client/ctcdecode/ctc_beam_search_decoder.cpp

Feb 14 '22 18:02 petewarden

Is this possibly a duplicate of #1949?

May 29 '22 19:05 DanielSWolf

I seem to be running into the same kind of problem. I'm running the Coqui STT 1.3.0 CLI with the default English models on a 1-minute recording. The error rate is very low. But at certain words or phrases ("door nail", "coffin nail", ...), it seems to get mixed up, swallows three or four words, then continues again:

Actual:     old mary was dead as a door nail mind i don't mean to say that i know
Recognized: old mary was dead as a door minion            mean to say that i know

Actual:     what there is particularly dead about a door nail i might have been inclined myself to regard
Recognized: what there is particularly dead about domitian                      inclined myself to regard

Here's my command:

stt --model model.tflite --scorer large_vocabulary.scorer --audio doornail.wav

If I run the same command without specifying the scorer, the result contains some errors, but no such omissions.

In case it helps, here's the audio file: doornail.zip

May 29 '22 19:05 DanielSWolf

STT STT copied to clipboard

Bug: Scorer gets stuck on bogus words

STT
STT copied to clipboard