transformers Rescale layer in whisper processor

Feature request

Whisper processor does not currently rescale to the expected [-1, 1) that it requires.

Motivation

Consistency between model processor layers.

Your contribution

Oct 26 '22 05:10 JeffreyWardman

Please provide a code reproducer for the bug you are experiencing or there is nothing we can do to help.

Oct 26 '22 13:10 sgugger

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForCTC


def inference(input, processor, model):
    output = processor(input, sampling_rate=16000, return_tensors="pt")
    
    if "whisper" in processor.tokenizer_class.lower():
        input_features = output.input_features
        with torch.no_grad():
            logits = model.generate(input_features)
        transcription = processor.batch_decode(logits, skip_special_tokens=True, output_word_offsets=True)[0]
    else:
        input_features = output.input_values
        with torch.no_grad():
            logits = model(input_features).logits[0]
            predicted_ids = torch.argmax(logits, dim=-1)
        transcription = processor.decode(predicted_ids, output_word_offsets=True)
    return transcription

def get_transcript(audio, model, processor):
    audio_scaled = ((audio - audio.min()) / (audio.max() - audio.min())) * (2) - 1
    scaled_transcription = inference(audio_scaled, processor, model)
    unscaled_transcription = inference(audio, processor, model)
    return {"scaled": scaled_transcription, "unscaled": unscaled_transcription}

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio = ds[0]["audio"]["array"]
audio = ((audio - audio.min()) / (audio.max() - audio.min())) * 65535  # Rescale to [0, 65535] to show issue

whisper_processor = WhisperProcessor.from_pretrained("openai/whisper-base.en")
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base.en").to("cpu")

wav2vec_processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
wav2vec_model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h")

whisper_transcripts = get_transcript(audio, whisper_model, whisper_processor)
wav2vec_transcripts = get_transcript(audio, wav2vec_model, wav2vec_processor)
print(f"WHISPER: {whisper_transcripts}")
print(f"WAV2VEC: {wav2vec_transcripts}")

Output:

WHISPER: {'scaled': ' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.', 
'unscaled': ' I'}

WAV2VEC: {'scaled': Wav2Vec2CTCTokenizerOutput(text='MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL', char_offsets=None, word_offsets=[{'word': 'MISTER', 'start_offset': 28, 'end_offset': 40}, {'word': 'QUILTER', 'start_offset': 43, 'end_offset': 60}, {'word': 'IS', 'start_offset': 66, 'end_offset': 69}, {'word': 'THE', 'start_offset': 72, 'end_offset': 76}, {'word': 'APOSTLE', 'start_offset': 80, 'end_offset': 103}, {'word': 'OF', 'start_offset': 109, 'end_offset': 111}, {'word': 'THE', 'start_offset': 115, 'end_offset': 118}, {'word': 'MIDDLE', 'start_offset': 120, 'end_offset': 131}, {'word': 'CLASSES', 'start_offset': 133, 'end_offset': 156}, {'word': 'AND', 'start_offset': 168, 'end_offset': 172}, {'word': 'WE', 'start_offset': 174, 'end_offset': 178}, {'word': 'ARE', 'start_offset': 181, 'end_offset': 185}, {'word': 'GLAD', 'start_offset': 187, 'end_offset': 200}, {'word': 'TO', 'start_offset': 205, 'end_offset': 209}, {'word': 'WELCOME', 'start_offset': 212, 'end_offset': 229}, {'word': 'HIS', 'start_offset': 234, 'end_offset': 240}, {'word': 'GOSPEL', 'start_offset': 245, 'end_offset': 267}]),
 'unscaled': Wav2Vec2CTCTokenizerOutput(text='MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL', char_offsets=None, word_offsets=[{'word': 'MISTER', 'start_offset': 28, 'end_offset': 40}, {'word': 'QUILTER', 'start_offset': 43, 'end_offset': 60}, {'word': 'IS', 'start_offset': 66, 'end_offset': 69}, {'word': 'THE', 'start_offset': 72, 'end_offset': 76}, {'word': 'APOSTLE', 'start_offset': 80, 'end_offset': 103}, {'word': 'OF', 'start_offset': 109, 'end_offset': 111}, {'word': 'THE', 'start_offset': 115, 'end_offset': 118}, {'word': 'MIDDLE', 'start_offset': 120, 'end_offset': 131}, {'word': 'CLASSES', 'start_offset': 133, 'end_offset': 156}, {'word': 'AND', 'start_offset': 168, 'end_offset': 172}, {'word': 'WE', 'start_offset': 174, 'end_offset': 178}, {'word': 'ARE', 'start_offset': 181, 'end_offset': 185}, {'word': 'GLAD', 'start_offset': 187, 'end_offset': 200}, {'word': 'TO', 'start_offset': 205, 'end_offset': 209}, {'word': 'WELCOME', 'start_offset': 212, 'end_offset': 229}, {'word': 'HIS', 'start_offset': 234, 'end_offset': 240}, {'word': 'GOSPEL', 'start_offset': 245, 'end_offset': 267}])}

Oct 27 '22 01:10 JeffreyWardman

You can see in the above that the transcript is gibberish for the unscaled whisper model. This is because it is taking in as input the range [0, 65535] rather than [-1, 1].

Oct 27 '22 01:10 JeffreyWardman

Thanks! cc @sanchit-gandhi and @ArthurZucker

Oct 27 '22 13:10 sgugger

Hey @JeffreyWardman, this is a really interesting issue! I've chosen not to compare Whisper to Wav2Vec2 in my analysis, as these two systems are intrinsically different in how they process the audio inputs:

With Wav2Vec2, we first normalise the raw audio inputs to (mean, std) = (0, 1). We then pass the normalised audio inputs to the model (as you have done in your code example). In this way, Wav2Vec2 takes as input audio inputs.

This is exactly the operation that the Wav2Vec2 feature extractor performs for us:

normalised_audio = wav2vec_processor.feature_extractor(audio).input_values

With Whisper, we first convert the raw audio inputs to a log-Mel spectrogram, and then feed this spectrogram to the Whisper model. In contrast to Wav2Vec2, Whisper takes the log-Mel features as inputs to the model (rather than audio values).

The audio -> log-Mel conversion is exactly the operation that the Whisper feature extractor performs for us:

logmel_features = whisper_processor.feature_extractor(audio).input_features

I've had a dig through the original Whisper codebase and compared it to the paper - it seems as though they perform the feature normalisation in the log-Mel space (c.f. Section 2.2 of the paper):

To check whether we missed something with our implementation, I ran your code example on the original Whisper repo. To reproduce this, first install the original (OpenAI) version of the model from https://github.com/openai/whisper:

pip install git+https://github.com/openai/whisper.git

I then tweaked your code snippet to make it compatible with the OpenAI model, following the "official" example provided in https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb:

import torch
import whisper
from datasets import load_dataset

device = "cuda" if torch.cuda.is_available() else "cpu"

model = whisper.load_model("base.en")
model.to(device)

# define the decoding options
options = whisper.DecodingOptions(language="en", without_timestamps=True)

# load audio sample as before
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio = ds[0]["audio"]["array"]
audio = ((audio - audio.min()) / (audio.max() - audio.min())) * 65535  # Rescale to [0, 65535] to show issue

def inference(audio):
  # whisper pre-processor expects torch tensors (not np.arrays or lists)
  audio = torch.tensor(audio)
  audio = whisper.pad_or_trim(audio.flatten()).to(device)
  mel = whisper.log_mel_spectrogram(audio)

  results = model.decode(mel, options)
  return results.text

def get_transcript(audio):
  audio_scaled = ((audio - audio.min()) / (audio.max() - audio.min())) * (2) - 1
  scaled_transcription = inference(audio_scaled)
  unscaled_transcription = inference(audio)
  return {"scaled": scaled_transcription, "unscaled": unscaled_transcription}

original_transcripts = get_transcript(audio)
print("ORIGINAL OpenAI: \n", original_transcripts)

Print output:

ORIGINAL OpenAI:  
{'scaled': 'Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.',
'unscaled': 'I'}

Which is the same output that we got with Transformers Whisper. So we can be sure that the Transformers implementation matches the official OpenAI one ✅ Meaning that this is an intrinsic problem with the Whisper model (rather than a Transformers implementation one). I think this comes down to the fact that the Whisper model does not normalise the audio inputs prior to passing them to the log-Mel spectrogram.

In Transformers, we aim to provide a matching implementation to the original model. In that regard, I don't think that we can currently change the codebase for the Transformers Whisper model to normalise audio samples before computing the log-Mel spectrogram features, since this is an inherent limitation of the Whisper model. Instead, what I'll do is post this issue on the original codebase and ask the authors whether this behaviour is expected. If they update their codebase to normalise the inputs, we can do the same in Transformers 🤗

Hope that makes sense and thank you for the great issue!

(edit: opened a discussion thread on the original OpenAI repo, awaiting the author's response https://github.com/openai/whisper/discussions/428#discussion-4510905)

Oct 27 '22 16:10 sanchit-gandhi

Thanks a lot @sanchit-gandhi 💯 , totally agree with you. Also in the various tests that I ran during the integration, I did not really have any issue with custom inputs, so I am also wondering id there are any potential application for that feature request? If yes, we could definitely add an optional argument, but otherwise, I am glad with keeping it close to the original codebase! 👍🏻

Oct 27 '22 17:10 ArthurZucker

I think it makes sense to offer an (optional) argument to the feature-extractor indicating whether the audio inputs should be normalised in the audio space:

do_normalise (Optional, defaults to False): whether or not to normalise the audio inputs prior to computing the log-Mel features.

This would look something along the lines of:

from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-base.en")
# don't normalise
input_features = feature_extractor(audio, do_normalise=False).input_features[0]
# do normalise
input_features = feature_extractor(audio, do_normalise=True).input_features[0]

-> we can add this quite easily for more control over inference

c.f. https://github.com/openai/whisper/discussions/428#discussioncomment-4057857

Nov 04 '22 15:11 sanchit-gandhi

Adding it to my whisper to do list

Nov 07 '22 10:11 ArthurZucker