transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Whisper do_sample through generation_config and generate() give different results

Open udeepam opened this issue 11 months ago • 5 comments

System Info

  • transformers version: 4.38.1
  • Platform: Linux-5.15.0-1049-aws-x86_64-with-glibc2.31
  • Python version: 3.11.6
  • Huggingface_hub version: 0.21.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.24.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

@sanchit-gandhi

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

I am running Whisper on a longform audio.

Below is the script I run where I use the generation_config to add the do_sample argument.

from __future__ import annotations

import copy
import os

import numpy as np
import numpy.typing as npt
import torch
from transformers import (
    WhisperForConditionalGeneration,
    WhisperProcessor,
)
from transformers.generation.configuration_utils import GenerationConfig
from transformers.generation.utils import GenerateBeamEncoderDecoderOutput
from transformers.pipelines.audio_utils import ffmpeg_read
from transformers import set_seed

set_seed(10)

MODEL: str = "openai/whisper-large-v3"
DEVICE: str = "cuda" if torch.cuda.is_available() else "cpu"
TORCH_DTYPE: torch.dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load the processor.
processor: WhisperProcessor = WhisperProcessor.from_pretrained(
    pretrained_model_name_or_path=MODEL,
    torch_dtype=TORCH_DTYPE,
)
# Load the model.
self.model: WhisperForConditionalGeneration = (
    WhisperForConditionalGeneration.from_pretrained(
        pretrained_model_name_or_path=MODEL,
        torch_dtype=TORCH_DTYPE,
    )
)
model.to(DEVICE)

# Load the audio as bytes.
with open(input_path, "rb") as f:
    input_bytes: bytes = f.read()
# Convert to numpy array.
inputs: npt.NDArray[np.float32] = ffmpeg_read(
    bpayload=input_bytes,
    sampling_rate=processor.feature_extractor.sampling_rate,
)

# Process the audio into chunks.
processed: dict[str, torch.Tensor] = processor.feature_extractor(
    raw_speech=inputs,
    truncation=False,
    return_attention_mask=True,
    padding="longest",
    sampling_rate=processor.feature_extractor.sampling_rate,
    return_tensors="pt",
)
processed = {
    k: v.to(device=DEVICE, dtype=TORCH_DTYPE)
    for k, v in processed.items()
}

# Generation config.
generation_config: GenerationConfig = copy.deepcopy(
    model.generation_config
)
generation_config.do_sample = True
generation_config.num_beams = 5
generation_config.condition_on_prev_tokens = True
generation_config.logprob_threshold = -1.0
generation_config.return_dict_in_generate = True

# Create the generate args.
generate_args: dict[str, Any] = {
    "generation_config": generation_config,
    "task": "transcribe",
    "language": "english",
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "return_segments": True,
}

# Run the model.
model_output: GenerateBeamEncoderDecoderOutput | dict[str, Any] = (
    model.generate(**processed, **generate_args)
)

# Decode the model ouputs.
text: str
optional: dict[str, Any]
text, optional = processor.tokenizer._decode_asr(
    model_outputs=[{"tokens": model_output["sequences"]}],
    return_timestamps=False,
    return_language="english",
    time_precision=processor.feature_extractor.chunk_length
    / model.config.max_source_positions,
)

The above produces one output. If instead of using generation_config I pass the arguments into the generate() function, e.g.

# Create the generate args.
generate_args: dict[str, Any] = {
    "task": "transcribe",
    "language": "english",
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "return_segments": True,
    "do_sample": True,
    "num_beams": 5,
    "condition_on_prev_tokens": True,
    "logprob_threshold": -1.0,
    "return_dict_in_generate": True,
}

# Run the model.
model_output: GenerateBeamEncoderDecoderOutput | dict[str, Any] = (
    model.generate(**processed, **generate_args)
)

I get a different output. Which given the seed is the same and nothing is changing shouldn't happen.

Experimenting with the arguments I have isolated the issue to the do_sample argument. Passing it through generation_config and generate() gives different results.

Expected behavior

Should get same result from above.

udeepam avatar Mar 01 '24 15:03 udeepam

Gentle ping @sanchit-gandhi @ylacombe

amyeroberts avatar Apr 10 '24 13:04 amyeroberts

Thanks @udeepam for the clear reproducer - could you take a look @kamilakesbi?

sanchit-gandhi avatar Apr 10 '24 13:04 sanchit-gandhi

Hi @udeepam, thanks for this issue and the clear reproducer!

On the latest version of the main branch ( transformers 4.40.0.dev0), I get the same results with and without generation_config, suggesting that this is already solved. I've run the following tests to compare the two results:

assert torch.equal(model_output['sequences'], model_output2['sequences'])

for i in range(len(model_output['segments'][0])): 
    assert torch.equal(model_output['segments'][0][i]['tokens'], model_output2['segments'][0][i]['tokens'])
    assert torch.equal(model_output['segments'][0][i]['start'], model_output2['segments'][0][i]['start'])
    assert torch.equal(model_output['segments'][0][i]['end'], model_output2['segments'][0][i]['end'])

assert torch.equal(model_output['segments'][0][-1]['result']['sequences'], model_output2['segments'][0][-1]['result']['sequences'])
assert torch.equal(model_output['segments'][0][-1]['result']['sequences_scores'], model_output2['segments'][0][-1]['result']['sequences_scores'])
assert torch.equal(model_output['segments'][0][-1]['result']['beam_indices'], model_output2['segments'][0][-1]['result']['beam_indices'])
assert model_output['segments'][0][-1]['result']['past_key_values'] == model_output2['segments'][0][-1]['result']['past_key_values']

for i in range(len(model_output['segments'][0][-1]['result']['scores'])): 
     assert torch.equal(model_output['segments'][0][-1]['result']['scores'][i], model_output2['segments'][0][-1]['result']['scores'][i])

cc @amyeroberts @sanchit-gandhi

kamilakesbi avatar Apr 12 '24 15:04 kamilakesbi

Thanks @kamilakesbi! As a quick tip @udeepam you can get the latest version of Transformers by installing from source or with an editable instal.

sanchit-gandhi avatar Apr 12 '24 15:04 sanchit-gandhi

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 07 '24 08:05 github-actions[bot]