pyctcdecode
pyctcdecode copied to clipboard
confidence scores output from the LM
Is there a way to get the confidence scores (word/sub-word level) also as the output? with decode_beams, it is possible to get the time information for alignment purposes and KenLM state, in addition to the segment level probabilities. It will be a nice addition if word-level confidence scores are also shown. Since this is calculated based on AM and LM (and optionally hotwords), we can do fine-grained analysis at the word level to remove or emphasize some words, as desired.
Hi, thanks for the question. As for the AM we decided to not include confidences out of the box, since there is no unique way to calculate them. Using the frame level annotations and averaging the probabilities or similar is probably the best bet here. As for respecting the LM and hotwords it gets a bit more complicated since neither are really normalized in a good way and it would probably depend heavily on the downstream task. Open to suggestions though if you have a strong use case
Hi, @gkucsko Thank you very much for your reply. I can get the confidence from e2e AM by averaging the frame-level probabilities as you mentioned. But with LM, understanding the confidence with which a word is predicted could shed light on the contribution of LM (not just perplexity) and help us to decide if a particular word is suitable to process further in SLU tasks. If the contribution from individual modules can be segregated at the word level, there should be a way to track back the individual word confidences from the top beam.
I'd also be very interested in this addition!
I think it should be relatively easy to additionally return the lm_score + am_score
that pyctcdecode
gives each word no?
Not sure if I understand the code a 100%, but this line here:
https://github.com/kensho-technologies/pyctcdecode/blob/9071d5091387579b4722cfcbe0c8597ad0b16227/pyctcdecode/decoder.py#L326
defines the lm_score
+ am_score
probability that is given by pyctcdecode
no? -
The am_score
corresponds to logit_score
and if I understand correctly this is just \sum_[word_start, word_end] (log(logit[i))
and lm_score
is the language model score returned by KenLM weighted by alpha and beta no?
So if we could just save those scores in some kind of list that would be very helpful IMO
What do you think @gkucsko ?
Also cc @lopez86 :-)
The main problem with using lm_score
(that is already returned here: https://github.com/kensho-technologies/pyctcdecode/blob/9071d5091387579b4722cfcbe0c8597ad0b16227/pyctcdecode/decoder.py#L498)
for confidence scoring is that the score is not at all normalized on length. E.g. longer transcription would necessarily have a lower lm_score
. One could normalize the score by the number of words but I wonder whether it's better to take the minimum of the words as described here.
Also related: https://discuss.huggingface.co/t/confidence-scores-self-training-for-wav2vec2-ctc-models-with-lm-pyctcdecode/17052