wav2letter
wav2letter copied to clipboard
Inference : Intermediate predictions and timestamps of each individual word in inference
Question
Question1 I'm using SimpleStreamingASRExample.cpp as an example for implementing online inference. As we are getting continuous stream of audio bytes, is there a way to return intermediate predictions from decodes and their confidence levels before the decoder finalized the prediction. One the decoder finalizes, then I would like to return final prediction and its confidence levels ?
Also is it possible to return alternate predictions and their confidence levels ?
Question2 Currently the SimpleStreamingASRExample.cpp returns transcription that is broken into 1000 msec buckets. Is it possible to compute and return individual word level timings of each word as it appeared since start of the audio ?
It would be great if you can provide a way to achieve this or a rough example to explain in a crude way how to get this?
Thank you
@tumusudheer: Regarding your 2nd question, transcription is printed by the printChunckTranscription
function (link). Word level timestamps could be found in WordUnit
class (link).
Hi @abhinavkulkarni ,
Thank you very much. So I modified printChunckTranscription
function as follows: I assumed the beginTimeFrame and endTimeFrame are milli seconds. If not, please let me know.
void printChunckTranscription(
std::ostream& output,
const std::vector<WordUnit>& wordUnits,
int chunckStartTime,
int chunckEndTime) {
output << chunckStartTime << "," << chunckEndTime << ",";
for (const auto& wordUnit : wordUnits) {
output << wordUnit.word << " (" << wordUnit.beginTimeFrame << "ms-"<< wordUnit.endTimeFrame << "ms) ";
}
output << std::endl;
}
} // namespace
Then ran the audio file
timing_test.zip
I got the output as below:
from this command:
cat /tmp/timing_test.wav | inference/inference/examples/simple_streaming_asr_example --input_files_base_path /tmp/inference_examples/inference/model
Creating LexiconDecoder instance.
#start (msec), end(msec), transcription
0,1000,
1000,2000,hallo (8ms-10ms)
2000,3000,how (7ms-7ms) are (9ms-9ms) you (11ms-11ms)
3000,3405,doing (14ms-14ms) good (19ms-19ms)
Completed converting audio input from stdin to text... elapsed time=451 milliseconds
But in the audio, the first word "Hello" (though transcribed as hallo) ends at about 500 milli seconds in the original audio. But the timings in the transcription came in 1000-2000 milli seconds bucket. Similarly, the word "how" ranges from 900-1100 milli seconds in the original audio but the transcription timings are completely different.
May be the beginTimeFrame and endTimeFrame are not milli seconds but something else and I need to do something to convert them to get the word timings in the original audio ?
Hi @tlikhomanenko , Do you have any insights on how to get the word level timestamps (as the words appeared in original Audio) during inference ? If there are any modifications that we can do to Inference Decoder or any other class to get these, please suggest and I'll do the code modifications. Seems #400 has some suggestions but I could not understand how to proceed.
Thank you
cc @avidov @xuqiantong @vineelpratap are more familiar with inference pipeline, could you navigate here?
Question1 I'm using SimpleStreamingASRExample.cpp as an example for implementing online inference. As we are getting continuous stream of audio bytes, is there a way to return intermediate predictions f
@vineelpratap may correct me, but I think that intermediate predictions in what you get. Since we currently do not consider history but simply processing a chunk at a time.
Hi @avidov ,
Thank you very much. But if we process a chunk at a time, the words that fall at the border of the chunks in the audio will get transcribed in inaccurate because their utterances may gets split in both chunks, right ? how to handle these kind of utterances ? Eg: If half of utterance of W1 falls in chunk 1 and other half of the utterance falls in chunk 2.
Also as you mentioned that the current inference code spits out intermediate results, how do we final results with their confidence values ?
For Question # 2, what are the current code modifications we get do to get the word level timestamps of the transcription ? (timestamps as the words appear in the original audio) ?
Hi @vineelpratap @avidov @tlikhomanenko
If I disable this line about pruning step and the run the audio I've attached with SimpleStreamingASR inference code, The wordUnit FrameNumbers (beginTimeFrame and endTimeFrame)
are giving some kind of information about the wordTimings:
This is the output I got
hello (7 - 7) how (17 - 17) are (20 - 20) you (22 - 22) doing (36 - 36) good (41 - 41)
The wordUnit class beginTimeFrame and endTimeFrame contains some relevant information about the timings(at least the starting time of the word).
How can I (https://github.com/facebookresearch/wav2letter/files/5406149/timing_test.zip) convert this to actual time of the word, some kind of multiplier that includes total length of audio (which is 3405.0 milli seconds for the attached audio) and max number of frames / max number of tokens ? [timing_test.zip]
Hi,
It would not be correct to rely on the timestamp given by the framework as we didn't try to build the system for it. The best estimate you can do for frames(a - b)
is a * 80 - 40
ms to b * 80 + 40
ms
Hi @vineelpratap ,
Thank you very much, You are correct, I'm not getting good estimate of the timings using the current framework, including with the use of the frames.
However, the online decoder is very fast and impressive. Great job. Thank you very much for this. I've a couple of questions:
-
How can I run the online decoder in the inference code with lexicon_free (where I can add use_lexicon=false) option in the inference code ?
-
Seems like the inference code is not using GPU, how to make it run on GPU. It may run much faster on GPU ?
Thank you
Hey @tumusudheer,
How can I run the online decoder in the inference code with lexicon_free (where I can add use_lexicon=false) option in the inference code ?
Discussion about this is currently ongoing here. If you have any additional insight, feel free to share.
Thanks!
Hey @tumusudheer, were you able to find a solution for finding word timings in the original audio?
I was able to do this with OpenSeq2Seq, but the transcription quality is really bad, and am hoping to use a better model running on wav2letter/flashlight.
Hi @micahjon , I have a question, Did you able run OpenSeq2Seq on Wav2letter ?, I see that speech2text is using a Deepspeech framework.
Hi @xuqiantong,
I referred to your comments in the #400 ticket (https://github.com/flashlight/wav2letter/issues/400#issuecomment-529723436), could you please help me by telling how to get the begin and end time for each word from the raw token and word sequences returned by decoder, Please provide an example if possible.
- I think my model has frame stride set to 10ms.
- MFSC feature is set to true
Thanks Vamsi Chagari