transformers Whisper do_sample through generation_config and generate() give different results

System Info

transformers version: 4.38.1
Platform: Linux-5.15.0-1049-aws-x86_64-with-glibc2.31
Python version: 3.11.6
Huggingface_hub version: 0.21.3
Safetensors version: 0.4.2
Accelerate version: 0.24.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@sanchit-gandhi

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I am running Whisper on a longform audio.

Below is the script I run where I use the generation_config to add the do_sample argument.

from __future__ import annotations

import copy
import os

import numpy as np
import numpy.typing as npt
import torch
from transformers import (
    WhisperForConditionalGeneration,
    WhisperProcessor,
)
from transformers.generation.configuration_utils import GenerationConfig
from transformers.generation.utils import GenerateBeamEncoderDecoderOutput
from transformers.pipelines.audio_utils import ffmpeg_read
from transformers import set_seed

set_seed(10)

MODEL: str = "openai/whisper-large-v3"
DEVICE: str = "cuda" if torch.cuda.is_available() else "cpu"
TORCH_DTYPE: torch.dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load the processor.
processor: WhisperProcessor = WhisperProcessor.from_pretrained(
    pretrained_model_name_or_path=MODEL,
    torch_dtype=TORCH_DTYPE,
)
# Load the model.
self.model: WhisperForConditionalGeneration = (
    WhisperForConditionalGeneration.from_pretrained(
        pretrained_model_name_or_path=MODEL,
        torch_dtype=TORCH_DTYPE,
    )
)
model.to(DEVICE)

# Load the audio as bytes.
with open(input_path, "rb") as f:
    input_bytes: bytes = f.read()
# Convert to numpy array.
inputs: npt.NDArray[np.float32] = ffmpeg_read(
    bpayload=input_bytes,
    sampling_rate=processor.feature_extractor.sampling_rate,
)

# Process the audio into chunks.
processed: dict[str, torch.Tensor] = processor.feature_extractor(
    raw_speech=inputs,
    truncation=False,
    return_attention_mask=True,
    padding="longest",
    sampling_rate=processor.feature_extractor.sampling_rate,
    return_tensors="pt",
)
processed = {
    k: v.to(device=DEVICE, dtype=TORCH_DTYPE)
    for k, v in processed.items()
}

# Generation config.
generation_config: GenerationConfig = copy.deepcopy(
    model.generation_config
)
generation_config.do_sample = True
generation_config.num_beams = 5
generation_config.condition_on_prev_tokens = True
generation_config.logprob_threshold = -1.0
generation_config.return_dict_in_generate = True

# Create the generate args.
generate_args: dict[str, Any] = {
    "generation_config": generation_config,
    "task": "transcribe",
    "language": "english",
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "return_segments": True,
}

# Run the model.
model_output: GenerateBeamEncoderDecoderOutput | dict[str, Any] = (
    model.generate(**processed, **generate_args)
)

# Decode the model ouputs.
text: str
optional: dict[str, Any]
text, optional = processor.tokenizer._decode_asr(
    model_outputs=[{"tokens": model_output["sequences"]}],
    return_timestamps=False,
    return_language="english",
    time_precision=processor.feature_extractor.chunk_length
    / model.config.max_source_positions,
)

The above produces one output. If instead of using generation_config I pass the arguments into the generate() function, e.g.

# Create the generate args.
generate_args: dict[str, Any] = {
    "task": "transcribe",
    "language": "english",
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "return_segments": True,
    "do_sample": True,
    "num_beams": 5,
    "condition_on_prev_tokens": True,
    "logprob_threshold": -1.0,
    "return_dict_in_generate": True,
}

# Run the model.
model_output: GenerateBeamEncoderDecoderOutput | dict[str, Any] = (
    model.generate(**processed, **generate_args)
)

I get a different output. Which given the seed is the same and nothing is changing shouldn't happen.

Experimenting with the arguments I have isolated the issue to the do_sample argument. Passing it through generation_config and generate() gives different results.

Expected behavior

Should get same result from above.

Mar 01 '24 15:03 udeepam

Gentle ping @sanchit-gandhi @ylacombe

Apr 10 '24 13:04 amyeroberts

Thanks @udeepam for the clear reproducer - could you take a look @kamilakesbi?

Apr 10 '24 13:04 sanchit-gandhi

Hi @udeepam, thanks for this issue and the clear reproducer!

On the latest version of the main branch ( transformers 4.40.0.dev0), I get the same results with and without generation_config, suggesting that this is already solved. I've run the following tests to compare the two results:

assert torch.equal(model_output['sequences'], model_output2['sequences'])

for i in range(len(model_output['segments'][0])): 
    assert torch.equal(model_output['segments'][0][i]['tokens'], model_output2['segments'][0][i]['tokens'])
    assert torch.equal(model_output['segments'][0][i]['start'], model_output2['segments'][0][i]['start'])
    assert torch.equal(model_output['segments'][0][i]['end'], model_output2['segments'][0][i]['end'])

assert torch.equal(model_output['segments'][0][-1]['result']['sequences'], model_output2['segments'][0][-1]['result']['sequences'])
assert torch.equal(model_output['segments'][0][-1]['result']['sequences_scores'], model_output2['segments'][0][-1]['result']['sequences_scores'])
assert torch.equal(model_output['segments'][0][-1]['result']['beam_indices'], model_output2['segments'][0][-1]['result']['beam_indices'])
assert model_output['segments'][0][-1]['result']['past_key_values'] == model_output2['segments'][0][-1]['result']['past_key_values']

for i in range(len(model_output['segments'][0][-1]['result']['scores'])): 
     assert torch.equal(model_output['segments'][0][-1]['result']['scores'][i], model_output2['segments'][0][-1]['result']['scores'][i])

cc @amyeroberts @sanchit-gandhi

Apr 12 '24 15:04 kamilakesbi

Thanks @kamilakesbi! As a quick tip @udeepam you can get the latest version of Transformers by installing from source or with an editable instal.

Apr 12 '24 15:04 sanchit-gandhi

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 07 '24 08:05 github-actions[bot]

transformers transformers copied to clipboard

Whisper do_sample through generation_config and generate() give different results

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard