transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Not able to run evaluate on whisper.tflite that got generated from TFWhisper model

Open nyadla-sys opened this issue 2 years ago • 45 comments

Model description

@gante Generated HF TFwhisper model into whisper.tfllite model. However, I'm not sure how to evaluate the created whisper tflite model.

https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/tflite_from_huggingface_whisper.ipynb

I would appreciate your assistance in evaluating whisper.tflite. The notebook mentioned above produces a whisper.tflite file.

Open source status

  • [X] The model implementation is available
  • [X] The model weights are available

Provide useful links for the implementation

No response

nyadla-sys avatar Oct 17 '22 19:10 nyadla-sys

Hi @nyadla-sys 👋

That is a great question! The problem here is that generation is much more than a forward pass of the model. Fortunately, our generation code is compatible with TF Graph mode, which means you can compile the entire generation procedure into a graph, which you can directly compare to our examples.

Here is a continuation of your notebook, which creates a TF Lite model for generation with Whisper: https://colab.research.google.com/drive/1tGL73xRs9mFUY5R03im0R6NNcvJriHun?usp=sharing

gante avatar Oct 18 '22 08:10 gante

@gante is it possible to add representative_dataset and generate tflite(int8) model. converter.representative_dataset = representative_dataset https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/tinynn_pytorch_to_tflite_int8.ipynb

nyadla-sys avatar Oct 18 '22 15:10 nyadla-sys

Hi @nyadla-sys wave

That is a great question! The problem here is that generation is much more than a forward pass of the model. Fortunately, our generation code is compatible with TF Graph mode, which means you can compile the entire generation procedure into a graph, which you can directly compare to our examples.

Here is a continuation of your notebook, which creates a TF Lite model for generation with Whisper: https://colab.research.google.com/drive/1tGL73xRs9mFUY5R03im0R6NNcvJriHun?usp=sharing

@gante Great work and appreciate for your efforts to make it open

nyadla-sys avatar Oct 18 '22 15:10 nyadla-sys

@nyadla-sys I don't know how to answer your latest question.

Gently pinging @hollance, who might have better pointers for Whisper + TF Lite + int8

gante avatar Oct 18 '22 15:10 gante

@gante Is it feasible to include Conv2d and avoid getting FlexConv2D as part of the model? TFLite interpreter needs to link Flex delegate in order to run the model since it contains the following Select TFop(s): Flex ops: FlexConv2D Details: tf.Conv2D(tensor<1x1x?x?xf32>, tensor<1x3x80x384xf32>) -> (tensor<1x1x?x384xf32>) : {data_format = "NHWC", device = "", dilations = [1, 1, 1, 1], explicit_paddings = [], padding = "VALID", strides = [1, 1, 1, 1], use_cudnn_on_gpu = true}

nyadla-sys avatar Oct 18 '22 23:10 nyadla-sys

@gante When I run generated tflite file with the minimal example from tensorflow/lite/example and it fails with below error msg

Execution plan as the list of 568 nodes invoked in-order: [0-567] --------------Subgraph-8 dump has completed--------------

--------------Memory Arena Status Start-------------- Total memory usage: 396 bytes (0.000 MB)

  • Total arena memory usage: 396 bytes (0.000 MB)
  • Total dynamic memory usage: 0 bytes (0.000 MB)

Subgraph#0 Arena (Normal) 268 (67.68%) Subgraph#0 Arena (Persistent) 128 (32.32%) --------------Memory Arena Status End--------------

2022-10-20 16:55:50.791845: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at conv_ops.cc:688 : INVALID_ARGUMENT: input depth must be evenly divisible by filter depth: 1 vs 80 ERROR: input depth must be evenly divisible by filter depth: 1 vs 80 ERROR: Node number 696 (TfLiteFlexDelegate) failed to invoke. Error at /home/niranjanyadla/useful_sensors/download_tools/openai-work/tflite_linux/tflite_build/tensorflow/tensorflow/lite/examples/minimal/minimal.cc:71

nyadla-sys avatar Oct 20 '22 23:10 nyadla-sys

@gante I modified generation code as below and it works fine

@tf.function( # shouldn't need static batch size, but throws exception without it (needs to be fixed) input_signature=[ tf.TensorSpec((1, 80, 3000), tf.float32, name="input_features"), ], )

nyadla-sys avatar Oct 21 '22 02:10 nyadla-sys

@gante I found that my 30-second audio has more generated ids than the 21 produced by the whisper TFlite model. Is there anything from the tflite model that I am missing? https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/tflite_from_huggingface_whisper.ipynb and also it does not produce english transcript for total of 30 seconds audio

nyadla-sys avatar Oct 27 '22 13:10 nyadla-sys

increased the max_tokens to 200 and now I could generate whole audio text

nyadla-sys avatar Oct 28 '22 02:10 nyadla-sys

@nyadla-sys two questions to help pinpoint the problem:

  1. Does the standard TF model (i.e. non-TFLite) work correctly for that audio file?
  2. If the answer to 1 is yes: can you share a code example of the problem? (the link above doesn't work for me)

gante avatar Oct 28 '22 17:10 gante

@gante now I modified the colab notebook to generate more tokens as per below line from HF colab

predicted_ids = model.generate(inputs, max_length=480_000) Referred this snippet from HF colab https://colab.research.google.com/drive/191WGH59ZZ-xyu8d6GWbuqZHa_MQJmQpA?usp=sharing#scrollTo=yENhy_7Qq5nU

nyadla-sys avatar Oct 28 '22 17:10 nyadla-sys

@gante @hollance

Have added something like below and it is giving segmentation fault. Could you please help me on this ? "converter.representative_dataset = representative_dataset" and def representative_dataset(): for x in range(1): inputs = feature_extractor( ds[x]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="tf") input_features = inputs.input_features # print(input_features) yield [input_features]

Please see the below code for detailed information: class GenerateModel(tf.Module): def init(self, model): super(GenerateModel, self).init() self.model = model

@tf.function( # shouldn't need static batch size, but throws exception without it (needs to be fixed) input_signature=[ tf.TensorSpec((1, 80, 3000), tf.float32, name="input_features"), ], ) def serving(self, input_features): outputs = self.model.generate( input_features, max_new_tokens=223, #change as needed return_dict_in_generate=True, ) return {"sequences": outputs["sequences"]}

def representative_dataset(): for x in range(1): inputs = feature_extractor( ds[x]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="tf") input_features = inputs.input_features # print(input_features) yield [input_features]

import tensorflow as tf

saved_model_dir = '/content/tf_whisper_saved' tflite_model_path = 'whisper.tflite'

generate_model = GenerateModel(model=model) tf.saved_model.save(generate_model, saved_model_dir, signatures={"serving_default": generate_model.serving})

Convert the model

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) converter.target_spec.supported_ops = [ tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops. tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops. ] converter.representative_dataset = representative_dataset #converter.inference_input_type = tf.int8 # or tf.uint8 #converter.inference_output_type = tf.int8 # or tf.uint8 converter.optimizations = [tf.lite.Optimize.DEFAULT] tflite_model = converter.convert()

Save the model

with open(tflite_model_path, 'wb') as f: f.write(tflite_model)

nyadla-sys avatar Oct 31 '22 18:10 nyadla-sys

@hollance @gante I was able to convert from Hugging face whisper onnx to tflite(int8) model,however am not sure how to run the inference on this model Could you please review and let me know if there is anything i am nissing in onnx to tflite conversion https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_to_onnx_tflite_int8.ipynb

nyadla-sys avatar Nov 02 '22 21:11 nyadla-sys

Hey @nyadla-sys -- model quantization with TFLite is beyond what we support at the moment here in transformers, I am afraid I won't dig into your issue at the moment.

You can, however, try asking that question in our forum 🤗, you might find support from other users there.

gante avatar Nov 03 '22 14:11 gante

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Nov 27 '22 15:11 github-actions[bot]

keep it open .!

nyadla-sys avatar Nov 28 '22 19:11 nyadla-sys

@gante Is it possible to modify the input audio spectrograms from 30s to 10 seconds in order to use them as input for a Hugging Face Whisper TensorFlow model?

on other note if you have any clue to generate int8 model ,please share your thoughts?

nyadla-sys avatar Dec 16 '22 02:12 nyadla-sys

@nyadla-sys

Is it possible to modify the input audio spectrograms from 30s to 10 seconds in order to use them as input for a Hugging Face Whisper TensorFlow model?

Not directly -- the model expects a fixed size input, corresponding to 30s.

if you have any clue to generate int8 model ,please share your thoughts?

I'm not an int8 expert, so I have minimal pointers: see our Optimum library, which has support for int8 quantization

gante avatar Dec 21 '22 19:12 gante

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jan 15 '23 15:01 github-actions[bot]

@gante I separated encoder and decoder tflite models, however while running inference of decoder I only get single output . Could you please review the notebook and let me know if you have any input for me.

nyadla-sys avatar Jan 22 '23 20:01 nyadla-sys

Hi @nyadla-sys 👋 TF Lite is not (yet) a priority for us, as we don't have enough bandwidth to support it. I won't look at your notebook.

gante avatar Jan 23 '23 11:01 gante

I was able to successfully separate the encoder and decoder whisper tflite models in the following notebook and working correctly . https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_encoder_decoder_tflite.ipynb

Posting here to help some of HF users who are interested in whisper tflite models

nyadla-sys avatar Jan 23 '23 23:01 nyadla-sys

@sanchit-gandhi how do i get transcript from the below script

import torch
from transformers import AutoFeatureExtractor, WhisperModel
from datasets import load_dataset

model = WhisperModel.from_pretrained("openai/whisper-base")
feature_extractor = AutoFeatureExtractor.from_pretrained("openai/whisper-base")
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
inputs = feature_extractor(ds[0]["audio"]["array"], return_tensors="pt")
input_features = inputs.input_features
decoder_input_ids = torch.tensor([[1, 1]]) * model.config.decoder_start_token_id
last_hidden_state = model(input_features, decoder_input_ids=decoder_input_ids).last_hidden_state
list(last_hidden_state.shape)

nyadla-sys avatar Jan 25 '23 18:01 nyadla-sys

@gante I am attempting to divide the TFWhisperModel into an encoder and a decoder, but the code I have is producing an error. Can you assist me in resolving this issue?

import tensorflow as tf
from transformers import TFWhisperModel
class WhisperEncoder(TFWhisperModel):
    def call(self, inputs, **kwargs):
        return self.encoder(inputs, **kwargs)

class WhisperDecoder(TFWhisperModel):
    def call(self, inputs, **kwargs):
        return self.decoder(inputs, **kwargs)


model = TFWhisperModel.from_pretrained("openai/whisper-tiny")
encoder_model = WhisperEncoder.from_pretrained("openai/whisper-tiny")
decoder_model = WhisperDecoder.from_pretrained("openai/whisper-tiny")


tf.saved_model.save(encoder_model, "whisper_encoder_model_dir")
tf.saved_model.save(decoder_model, "whisper_decoder_model_dir")

here is the error message TypeError: Exception encountered when calling layer "whisper_encoder" (type WhisperEncoder).

encoder() got an unexpected keyword argument 'training'

Call arguments received by layer "whisper_encoder" (type WhisperEncoder): • inputs={'input_features': 'tf.Tensor(shape=(2, 80, 2999), dtype=float32)', 'decoder_input_ids': 'tf.Tensor(shape=(1, 2), dtype=int32)'} • kwargs={'training': 'None'}

nyadla-sys avatar Jan 26 '23 19:01 nyadla-sys

Hey @nyadla-sys 👋 The encoder and decoder components of Whisper, when isolated, are not compatible with from_pretrained. However, you can still serialize them separately, from different sources:

import tensorflow as tf
from transformers import TFWhisperModel

model_1 = TFWhisperModel.from_pretrained("openai/whisper-tiny")
model_2 = TFWhisperModel.from_pretrained("openai/whisper-tiny")

tf.saved_model.save(model_1.get_encoder(), "/tmp/whisper/encoder")
tf.saved_model.save(model_2.get_decoder(), "/tmp/whisper/decoder")

gante avatar Jan 27 '23 11:01 gante

Hey @nyadla-sys! For inference, we can use the .generate() method to auto-regressively generate using the Whisper model:

import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
processor = AutoProcessor.from_pretrained("openai/whisper-base")

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
inputs = processor(ds[0]["audio"]["array"], return_tensors="pt")

input_features = inputs.input_features

with torch.no_grad():
    predicted_ids = model.generate(input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)

Print Output:

[' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.']

sanchit-gandhi avatar Jan 27 '23 11:01 sanchit-gandhi

@sanchit-gandhi is it possible to generate transcript using TFWhisperModel ? instead of WhisperForConditionGeneragion

nyadla-sys avatar Jan 27 '23 14:01 nyadla-sys

Hey @nyadla-sys!

TFWhisperModel is just the base encoder-decoder model that outputs decoder hidden-states: https://huggingface.co/docs/transformers/model_doc/whisper#transformers.TFWhisperModel

TFWhisperForConditionalGeneration adds a language modelling head on top of TFWhisperModel, mapping the decoder hidden-states to logits over the vocabulary: https://huggingface.co/docs/transformers/model_doc/whisper#transformers.TFWhisperForConditionalGeneration

So you'll need TFWhisperForConditionalGeneration in order to get logits over the vocab (and hence generate text)

Hope that makes sense!

sanchit-gandhi avatar Feb 01 '23 17:02 sanchit-gandhi

@sanchit-gandhi Is it possible to directly map the decoder hidden states to logits without using the language modeling head? I am focusing on using only TFWhisperModel because it can be fully converted into an int8 model. I'm curious if there is any way to generate text using the decoder hidden states without adding the language modeling head.

nyadla-sys avatar Feb 01 '23 17:02 nyadla-sys

Hey @nyadla-sys, it's precisely the job of the language modelling head to directly map the decoder hidden-states to logits. The language modelling head is a single linear layer that maps from $\mathcal{R}^{d} -> \mathcal{R}^{v}$, where $d$ is the dimensionality of the hidden-states, and $v$ is the dimensionality of the vocabulary, so for Whisper small this is a mapping from 768 -> 52000

So if you need to map to the vocabulary, you're best off using TFWhisperForConditionalGeneration!

sanchit-gandhi avatar Feb 01 '23 17:02 sanchit-gandhi