transformers Whisper - get probability of detected language

System Info

transformers version: 4.38.0.dev0
Platform: Linux-4.15.0-142-generic-x86_64-with-glibc2.23
Python version: 3.10.11
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.2.0+cu121 (True)
Tensorflow version (GPU?): 2.12.0 (True)

Who can help?

@sanchit-gandhi I guess, since he's the one who provided the answer in the previous git issue.

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Following #25138, @sanchit-gandhi provided an answer to retrieve the language using Whisper model and processor (since Whisper conditionnal tokens include the language token). He later provided a little adaptation in order to get the probability of the language. This is a nice possibility. However, using the latest version of transformers it seems that it's not possible anymore (that's why I write it as a bug but could also be a feature request).

Quick example in order to check :

language_identification = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to("cuda:0")
lid_processor = WhisperProcessor.from_pretrained("openai/whisper-small")

audio, _ = librosa.load(<my_file>, sr=16000)

lid = lid_processor(audio, sampling_rate=16000, return_tensors="pt", truncation=True)
input_features = lid.input_features.to("cuda:0", torch.float32)

outputs = language_identification.generate(input_features, 
      output_scores=True,  
      return_dict_in_generate=True, 
      max_new_tokens=1)

pred_text = lid_processor.batch_decode(outputs.sequences, skip_special_tokens=False)
pred_text

pred_text is :

['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> 80']

Here we see the conditionnal tokens as well as my only transcription token 80 (because of max_new_tokens=1) The issue is that outputs.scores object (which is used for the probabilities of each token. Size is (N_TOKEN, 1, 51865), 51865 is Whisper vocabulary size) only returns the probabilities for the tokens after the conditionnal tokens. I.e, outputs.scores has a length of only 1 because I asked only 1 token for generation (if I would have wrote 5, I would have got a length of 5).

This means that using the transitions scores computed as follow :

transition_scores = language_identification.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True
)

will produce only the scores for the tokens generated after the specials tokens SoT, lang, task, notimestamps (if not asking for).

I also tried without asking for timestamps because my guess was that since notimestamps token is after lang and task, maybe having the notimestamps token injected manually was maybe making the code to fall in a special if condition where the scores of the previous tokens (lang and task) would be ignored somehow.

Expected behavior

I would have expected the outputs.scores to have the scores for the language token (if language isn't forced obviously) as it was probably meant to be according to the answer in #25138.

With that, we could easily guess the score for the language, and maybe have a ranking (like EN with score of 0.8, FR with score of 0.1 and so on).

Feb 26 '24 11:02 antoinethl

I checked with a previous version of transformers (4.28.1) and it seems to work : max_new_tokens=1 generates only the language token, and so the score associated is indeed the one of the language token.

It is then a regression / bug in the new version of transformers.

Feb 27 '24 13:02 antoinethl

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mar 28 '24 08:03 github-actions[bot]

Hey @antoinethl . Sorry for the delay, when you tried with the older version of transformers, are you sure that the decoder_input_ids were not just 2 tokens ? This could just mean that the generation config was changes (lots of updates), and by default it ads notimestamps and the predicted language token.

I am not the best person to talk about this, as I missed a few issues but sounds like it's not a regression, maybe a feature request˜!

Mar 30 '24 16:03 ArthurZucker

Hey @antoinethl, sorry for the delay here. Previously, we computed the log-probs for the language and task tokens, even if these were implicitly specified by the user. Since https://github.com/huggingface/transformers/pull/28687, we now pass the language and task tokens as decoder input ids to the model. This saves two forward passes of the decoder per decoding loop, since we now don't have to run a forward pass for these tokens, but we loose the log-prob computation.

If you're happy doing an extra forward pass of the encoder and decoder, you can compute the language probability scores as follows:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset, Audio
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny", low_cpu_mem_usage=True)
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model.to(device)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(16_000))
sample = next(iter(dataset))

# pre-process the audio inputs for sequential long form generation
inputs = processor([sample["audio"]["array"], sample["audio"]["array"]], padding=True, truncation=False, return_attention_mask=True, return_tensors="pt", sampling_rate=16_000).to(device)

input_stride = model.model.encoder.conv1.stride[0] * model.model.encoder.conv2.stride[0]
num_segment_frames = input_stride * model.config.max_source_positions
batch_size = inputs.input_features.shape[0]

# predict the language from the first 30-second chunk
decoder_input_ids = (torch.ones((batch_size, 1), device=device, dtype=torch.long) * model.generation_config.decoder_start_token_id)
input_features = inputs.input_features[:, :, :num_segment_frames]

with torch.no_grad():
    logits = model(input_features, decoder_input_ids=decoder_input_ids).logits[:, -1]

# auto-regressively generate
pred_ids = model.generate(**inputs)
pred_text = processor.batch_decode(pred_ids)

language_probs = torch.gather(logits, 1, pred_ids[:, 1:2]).squeeze(1)

We use a similar logic in the generation code in Whisper: https://github.com/huggingface/transformers/blob/b7d002bdff3646cfd55f120b2b9e1b065d54fae5/src/transformers/models/whisper/generation_whisper.py#L1210

If you feel strongly that the language prob should also be part of the generation output, this is definitely something we can discuss. It's the first time I've seen this requested since we did the refactoring of Whisper generate, so to me it looks like solving it with an extra few lines of code and doing an extra forward pass might be the easiest solution here.

cc @kamilakesbi

Apr 10 '24 13:04 sanchit-gandhi

Gentle ping @kamilakesbi

May 07 '24 10:05 amyeroberts

For now I haven't seen any further requests to integrate the language probability as part of the Whisper output. In the interest of keeping the outputs from generate consistent with other models, I suggest we leave the generation code as is, and encourage users to run an extra encoder + decoder forward pass should they need the language probs.

May 08 '24 10:05 sanchit-gandhi

Note that the return_language argument is available using the pipeline API. You can use it as follows @antoinethl:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=256,
    chunk_length_s=30.0,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample, return_language=True)
print(result)

Which gives the predicted language for each chunk:

{'text': ' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.',
 'chunks': [{'language': 'english',
   'timestamp': (0.0, 5.86),
   'text': ' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'}]}

May 22 '24 12:05 sanchit-gandhi

return_language doesn't seem to work with word-level timestamps.

Jun 30 '24 08:06 hanif-rt

Hi @hanif-rt, this should be solved with PR #31572 :)

Jul 08 '24 11:07 kamilakesbi

@sanchit-gandhi Hi, I wonder how can I directly get logits from "generate" method. Thanks

Aug 20 '24 09:08 dengchengxifrank

You should use output_scores=True, return_dict_in_generate=True when calling generate

Aug 20 '24 12:08 ArthurZucker

@ArthurZucker Thanks！！！ I am also wondering how I can check the original generate function code. In generation_whisper.py I can see super.generate(), but I don't know where can I see the super‘s function.

Aug 20 '24 15:08 dengchengxifrank

https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1588 🤗

Aug 20 '24 16:08 ArthurZucker

@ArthurZucker Thanks! 🤗

Aug 20 '24 17:08 dengchengxifrank

@sanchit-gandhi: Setting "return_language" flag to true, is not helping for Mulit-Lingual use-case. Model is returning only one language even though there are multiple languages in a given audio and for english it is returning as language id as None.

TC and Results: Audio has the following contents: "Hello, how are you? Hola, como estas? Bonjour, como se va?". Model gave the following results: result:{'text': 'Hello, how are you? Hola, como estas? Bonjour, como se va?', 'chunks': [{'language': None, 'text': 'Hello, how are you? Hola, como estas? Bonjour, como se va?'}]

Ask: is there way to get all the language ids and its probabilities via the Pipeline interface?.

Aug 27 '24 19:08 vchagari

That is why this is a feature request! Otherwise to get all the predicted language ids, I am not entirely sure, have not dug in the generate code in a while, but the model itself cannot switch languages mid 30s audio, it's possible between each 30s samples.

Aug 28 '24 09:08 ArthurZucker

+1 to register the request for integrating the language probability as part of the Whisper output.

Suggestion: when return_language_prob in pipe(), returns language_prob of the language having max probablity.

@sanchit-gandhi

Sep 07 '24 14:09 pranavchaturved

I want to join the list of people here that were negatively affected by the change and wish it did not affect users that do not explicitly specify a language in .generate().

Our lab has used langdetect with Whisper and .generate() extensively to prepare and filter datasets for training of speech-to-text models. We additionally used the functionality in large scale batched inference, where we first detect the language for chunks and then perform batched inference where different chunks may be from different languages.

This is an annoyance from our viewpoint, so we pin our transformers versions to before this change was implemented.

I understand that it may not be relevant to return these probabilities when a user explicitly specifies the language. But the fact that these outputs are also automatically removed for every user which does not specify language is not ideal.

Very strong +1 for some king of option in .generate() for returning all logits.

Sep 12 '24 16:09 Lauler

for reference @ylacombe

Sep 16 '24 12:09 amyeroberts

transformers transformers copied to clipboard

Whisper - get probability of detected language

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard