[Audio.transcribe] Logprobs for each token in verbose_json
Describe the feature or improvement you're requesting
Currently, Whisper exposes avg_logprob for an entire segment. The request is to expose logprobs for each token.
{
"duration": 4.01,
"language": "english",
"segments": [
{
"avg_logprob": -0.40153955010806813,
"compression_ratio": 1.0526315789473684,
"end": 4.0,
"id": 0,
"no_speech_prob": 0.1633709967136383,
"seek": 0,
"start": 0.0,
"temperature": 0.0,
"text": " Testing, testing, this is going to be a new audio recording.",
"tokens": [
50364,
45517,
11,
4997,
11,
341,
307,
516,
281,
312,
257,
777,
6278,
6613,
13,
50564
],
"transient": false
}
],
"task": "transcribe",
"text": "Testing, testing, this is going to be a new audio recording."
}
Additional context
No response
This is due to a limitation in Whisper. I skimmed through https://github.com/openai/whisper/blob/7858aa9c08d98f75575035ecd6481f462d66ca27/whisper/decoding.py#L110 and the good news is that it doesn't seem like it would be that hard to change. Key line of code is this:
logprobs = F.log_softmax(logits.float(), dim=-1)
The main modification you'd need to do would be adding token probabilities to https://github.com/openai/whisper/blob/7858aa9c08d98f75575035ecd6481f462d66ca27/whisper/transcribe.py#L23 where currently only avg_logprob is included.
Hey! I flagged this to the team, I am going to close for now since this repo is for the Python SDK not API feedback but will follow up if we end up adding this.