CTranslate2
CTranslate2 copied to clipboard
Whisper - Correct way to get prediction probability of each token and timestamp alignment
I have spent time looking at the documentation but did not manage to find proper way to get the prediction probabilities of all tokens. Also, how can i get the time-token alignment output from the model, with just the audio input features as input. I did see the align function, but the function requires input features and the input text tokens - which does not seem to meet my need.