icefall
icefall copied to clipboard
No duration or confidence info for alignments
I've tried generating alignments for a pruned_transducer_stateless7 model using https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7/compute_ali.py. Looking at the output cuts I can only find the start times for tokens/words. Is there a way to get the duration and confidence info too?
Sorry, we can only get the start time of a token for transducer models.
Is it safe to assume that the end time for the token is the start time of the next token? It wouldn't be accurate if there's some silence between words unless the aligner can predict blank tokens. Can this be achieved?
As for the confidences, I've thought about estimating it from log_probs like here #1092
I think the time in alignment corresponds to the position in which the non-blank symbol was emitted by the transducer model. The transducer posteriors are all blanks with 1-frame spikes for the non-blank symbols. The token duration cannot be retrieved from the transducer posteriors. And some heuristic needs to be used, possibly also relying on endpointing...
For the project i was experimenting with the transducer confidences, this was integrated into sherpa-onnx. However the best Normalized cross-entropy achieved was only 0.169, which is quite low...
The way to compute confidence was:
- not considering the blank posteriors
- temperature scale 2.0 of joiner output
- the lowest token posterior was playing role of the word score (a proxy)
- 2 parameter logistic regression to calibrate the word score
Not sure if better NCE results can be achieved through getting acoustic score "under" the CTC alignment. It is possible, but I did not try that... maybe later...
Perhaps this could be integrated with icefall, even without the FSTs : https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html