transformers
transformers copied to clipboard
Whisper - get probability of detected language
System Info
-
transformers
version: 4.38.0.dev0 - Platform: Linux-4.15.0-142-generic-x86_64-with-glibc2.23
- Python version: 3.10.11
- Huggingface_hub version: 0.20.3
- Safetensors version: 0.4.2
- Accelerate version: 0.27.2
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- Tensorflow version (GPU?): 2.12.0 (True)
Who can help?
@sanchit-gandhi I guess, since he's the one who provided the answer in the previous git issue.
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Following #25138, @sanchit-gandhi provided an answer to retrieve the language using Whisper model and processor (since Whisper conditionnal tokens include the language token). He later provided a little adaptation in order to get the probability of the language. This is a nice possibility. However, using the latest version of transformers it seems that it's not possible anymore (that's why I write it as a bug but could also be a feature request).
Quick example in order to check :
language_identification = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to("cuda:0")
lid_processor = WhisperProcessor.from_pretrained("openai/whisper-small")
audio, _ = librosa.load(<my_file>, sr=16000)
lid = lid_processor(audio, sampling_rate=16000, return_tensors="pt", truncation=True)
input_features = lid.input_features.to("cuda:0", torch.float32)
outputs = language_identification.generate(input_features,
output_scores=True,
return_dict_in_generate=True,
max_new_tokens=1)
pred_text = lid_processor.batch_decode(outputs.sequences, skip_special_tokens=False)
pred_text
pred_text
is :
['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> 80']
Here we see the conditionnal tokens as well as my only transcription token 80
(because of max_new_tokens=1
)
The issue is that outputs.scores
object (which is used for the probabilities of each token. Size is (N_TOKEN, 1, 51865), 51865 is Whisper vocabulary size) only returns the probabilities for the tokens after the conditionnal tokens. I.e, outputs.scores
has a length of only 1 because I asked only 1 token for generation (if I would have wrote 5, I would have got a length of 5).
This means that using the transitions scores computed as follow :
transition_scores = language_identification.compute_transition_scores(
outputs.sequences, outputs.scores, normalize_logits=True
)
will produce only the scores for the tokens generated after the specials tokens SoT, lang, task, notimestamps (if not asking for).
I also tried without asking for timestamps because my guess was that since notimestamps
token is after lang and task, maybe having the notimestamps
token injected manually was maybe making the code to fall in a special if condition
where the scores of the previous tokens (lang and task) would be ignored somehow.
Expected behavior
I would have expected the outputs.scores
to have the scores for the language token (if language isn't forced obviously) as it was probably meant to be according to the answer in #25138.
With that, we could easily guess the score for the language, and maybe have a ranking (like EN with score of 0.8, FR with score of 0.1 and so on).
I checked with a previous version of transformers (4.28.1) and it seems to work :
max_new_tokens=1
generates only the language token, and so the score associated is indeed the one of the language token.
It is then a regression / bug in the new version of transformers.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hey @antoinethl . Sorry for the delay, when you tried with the older version of transformers, are you sure that the decoder_input_ids
were not just 2 tokens ?
This could just mean that the generation config was changes (lots of updates), and by default it ads notimestamps
and the predicted language token.
I am not the best person to talk about this, as I missed a few issues but sounds like it's not a regression, maybe a feature request˜!
Hey @antoinethl, sorry for the delay here. Previously, we computed the log-probs for the language and task tokens, even if these were implicitly specified by the user. Since https://github.com/huggingface/transformers/pull/28687, we now pass the language and task tokens as decoder input ids to the model. This saves two forward passes of the decoder per decoding loop, since we now don't have to run a forward pass for these tokens, but we loose the log-prob computation.
If you're happy doing an extra forward pass of the encoder and decoder, you can compute the language probability scores as follows:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset, Audio
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny", low_cpu_mem_usage=True)
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model.to(device)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(16_000))
sample = next(iter(dataset))
# pre-process the audio inputs for sequential long form generation
inputs = processor([sample["audio"]["array"], sample["audio"]["array"]], padding=True, truncation=False, return_attention_mask=True, return_tensors="pt", sampling_rate=16_000).to(device)
input_stride = model.model.encoder.conv1.stride[0] * model.model.encoder.conv2.stride[0]
num_segment_frames = input_stride * model.config.max_source_positions
batch_size = inputs.input_features.shape[0]
# predict the language from the first 30-second chunk
decoder_input_ids = (torch.ones((batch_size, 1), device=device, dtype=torch.long) * model.generation_config.decoder_start_token_id)
input_features = inputs.input_features[:, :, :num_segment_frames]
with torch.no_grad():
logits = model(input_features, decoder_input_ids=decoder_input_ids).logits[:, -1]
# auto-regressively generate
pred_ids = model.generate(**inputs)
pred_text = processor.batch_decode(pred_ids)
language_probs = torch.gather(logits, 1, pred_ids[:, 1:2]).squeeze(1)
We use a similar logic in the generation code in Whisper: https://github.com/huggingface/transformers/blob/b7d002bdff3646cfd55f120b2b9e1b065d54fae5/src/transformers/models/whisper/generation_whisper.py#L1210
If you feel strongly that the language prob should also be part of the generation output, this is definitely something we can discuss. It's the first time I've seen this requested since we did the refactoring of Whisper generate, so to me it looks like solving it with an extra few lines of code and doing an extra forward pass might be the easiest solution here.
cc @kamilakesbi
Gentle ping @kamilakesbi
For now I haven't seen any further requests to integrate the language probability as part of the Whisper output. In the interest of keeping the outputs from generate consistent with other models, I suggest we leave the generation code as is, and encourage users to run an extra encoder + decoder forward pass should they need the language probs.
Note that the return_language
argument is available using the pipeline API. You can use it as follows @antoinethl:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=256,
chunk_length_s=30.0,
batch_size=16,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample, return_language=True)
print(result)
Which gives the predicted language for each chunk:
{'text': ' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.',
'chunks': [{'language': 'english',
'timestamp': (0.0, 5.86),
'text': ' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'}]}
return_language
doesn't seem to work with word-level timestamps.
Hi @hanif-rt, this should be solved with PR #31572 :)
@sanchit-gandhi Hi, I wonder how can I directly get logits from "generate" method. Thanks
You should use output_scores=True, return_dict_in_generate=True
when calling generate
@ArthurZucker Thanks!!! I am also wondering how I can check the original generate function code. In generation_whisper.py I can see super.generate(), but I don't know where can I see the super‘s function.
https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1588 🤗
@ArthurZucker Thanks! 🤗
@sanchit-gandhi: Setting "return_language" flag to true, is not helping for Mulit-Lingual use-case. Model is returning only one language even though there are multiple languages in a given audio and for english it is returning as language id as None.
TC and Results: Audio has the following contents: "Hello, how are you? Hola, como estas? Bonjour, como se va?". Model gave the following results: result:{'text': 'Hello, how are you? Hola, como estas? Bonjour, como se va?', 'chunks': [{'language': None, 'text': 'Hello, how are you? Hola, como estas? Bonjour, como se va?'}]
Ask: is there way to get all the language ids and its probabilities via the Pipeline interface?.
That is why this is a feature request! Otherwise to get all the predicted language ids, I am not entirely sure, have not dug in the generate code in a while, but the model itself cannot switch languages mid 30s audio, it's possible between each 30s samples.
+1 to register the request for integrating the language probability as part of the Whisper output.
Suggestion: when return_language_prob in pipe(), returns language_prob of the language having max probablity.
@sanchit-gandhi
I want to join the list of people here that were negatively affected by the change and wish it did not affect users that do not explicitly specify a language in .generate()
.
Our lab has used langdetect with Whisper and .generate()
extensively to prepare and filter datasets for training of speech-to-text models. We additionally used the functionality in large scale batched inference, where we first detect the language for chunks and then perform batched inference where different chunks may be from different languages.
This is an annoyance from our viewpoint, so we pin our transformers
versions to before this change was implemented.
I understand that it may not be relevant to return these probabilities when a user explicitly specifies the language. But the fact that these outputs are also automatically removed for every user which does not specify language is not ideal.
Very strong +1 for some king of option in .generate()
for returning all logits.
for reference @ylacombe