transformers
transformers copied to clipboard
Not able to run evaluate on whisper.tflite that got generated from TFWhisper model
Model description
@gante Generated HF TFwhisper model into whisper.tfllite model. However, I'm not sure how to evaluate the created whisper tflite model.
https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/tflite_from_huggingface_whisper.ipynb
I would appreciate your assistance in evaluating whisper.tflite. The notebook mentioned above produces a whisper.tflite file.
Open source status
- [X] The model implementation is available
- [X] The model weights are available
Provide useful links for the implementation
No response
Hi @nyadla-sys 👋
That is a great question! The problem here is that generation is much more than a forward pass of the model. Fortunately, our generation code is compatible with TF Graph mode, which means you can compile the entire generation procedure into a graph, which you can directly compare to our examples.
Here is a continuation of your notebook, which creates a TF Lite model for generation with Whisper: https://colab.research.google.com/drive/1tGL73xRs9mFUY5R03im0R6NNcvJriHun?usp=sharing
@gante is it possible to add representative_dataset and generate tflite(int8) model. converter.representative_dataset = representative_dataset https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/tinynn_pytorch_to_tflite_int8.ipynb
Hi @nyadla-sys wave
That is a great question! The problem here is that generation is much more than a forward pass of the model. Fortunately, our generation code is compatible with TF Graph mode, which means you can compile the entire generation procedure into a graph, which you can directly compare to our examples.
Here is a continuation of your notebook, which creates a TF Lite model for generation with Whisper: https://colab.research.google.com/drive/1tGL73xRs9mFUY5R03im0R6NNcvJriHun?usp=sharing
@gante Great work and appreciate for your efforts to make it open
@nyadla-sys I don't know how to answer your latest question.
Gently pinging @hollance, who might have better pointers for Whisper + TF Lite + int8
@gante Is it feasible to include Conv2d and avoid getting FlexConv2D as part of the model? TFLite interpreter needs to link Flex delegate in order to run the model since it contains the following Select TFop(s): Flex ops: FlexConv2D Details: tf.Conv2D(tensor<1x1x?x?xf32>, tensor<1x3x80x384xf32>) -> (tensor<1x1x?x384xf32>) : {data_format = "NHWC", device = "", dilations = [1, 1, 1, 1], explicit_paddings = [], padding = "VALID", strides = [1, 1, 1, 1], use_cudnn_on_gpu = true}
@gante When I run generated tflite file with the minimal example from tensorflow/lite/example and it fails with below error msg
Execution plan as the list of 568 nodes invoked in-order: [0-567] --------------Subgraph-8 dump has completed--------------
--------------Memory Arena Status Start-------------- Total memory usage: 396 bytes (0.000 MB)
- Total arena memory usage: 396 bytes (0.000 MB)
- Total dynamic memory usage: 0 bytes (0.000 MB)
Subgraph#0 Arena (Normal) 268 (67.68%) Subgraph#0 Arena (Persistent) 128 (32.32%) --------------Memory Arena Status End--------------
2022-10-20 16:55:50.791845: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at conv_ops.cc:688 : INVALID_ARGUMENT: input depth must be evenly divisible by filter depth: 1 vs 80 ERROR: input depth must be evenly divisible by filter depth: 1 vs 80 ERROR: Node number 696 (TfLiteFlexDelegate) failed to invoke. Error at /home/niranjanyadla/useful_sensors/download_tools/openai-work/tflite_linux/tflite_build/tensorflow/tensorflow/lite/examples/minimal/minimal.cc:71
@gante I modified generation code as below and it works fine
@tf.function( # shouldn't need static batch size, but throws exception without it (needs to be fixed) input_signature=[ tf.TensorSpec((1, 80, 3000), tf.float32, name="input_features"), ], )
@gante I found that my 30-second audio has more generated ids than the 21 produced by the whisper TFlite model. Is there anything from the tflite model that I am missing? https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/tflite_from_huggingface_whisper.ipynb and also it does not produce english transcript for total of 30 seconds audio
increased the max_tokens to 200 and now I could generate whole audio text
@nyadla-sys two questions to help pinpoint the problem:
- Does the standard TF model (i.e. non-TFLite) work correctly for that audio file?
- If the answer to 1 is yes: can you share a code example of the problem? (the link above doesn't work for me)
@gante now I modified the colab notebook to generate more tokens as per below line from HF colab
predicted_ids = model.generate(inputs, max_length=480_000) Referred this snippet from HF colab https://colab.research.google.com/drive/191WGH59ZZ-xyu8d6GWbuqZHa_MQJmQpA?usp=sharing#scrollTo=yENhy_7Qq5nU
@gante @hollance
Have added something like below and it is giving segmentation fault. Could you please help me on this ? "converter.representative_dataset = representative_dataset" and def representative_dataset(): for x in range(1): inputs = feature_extractor( ds[x]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="tf") input_features = inputs.input_features # print(input_features) yield [input_features]
Please see the below code for detailed information: class GenerateModel(tf.Module): def init(self, model): super(GenerateModel, self).init() self.model = model
@tf.function( # shouldn't need static batch size, but throws exception without it (needs to be fixed) input_signature=[ tf.TensorSpec((1, 80, 3000), tf.float32, name="input_features"), ], ) def serving(self, input_features): outputs = self.model.generate( input_features, max_new_tokens=223, #change as needed return_dict_in_generate=True, ) return {"sequences": outputs["sequences"]}
def representative_dataset(): for x in range(1): inputs = feature_extractor( ds[x]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="tf") input_features = inputs.input_features # print(input_features) yield [input_features]
import tensorflow as tf
saved_model_dir = '/content/tf_whisper_saved' tflite_model_path = 'whisper.tflite'
generate_model = GenerateModel(model=model) tf.saved_model.save(generate_model, saved_model_dir, signatures={"serving_default": generate_model.serving})
Convert the model
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) converter.target_spec.supported_ops = [ tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops. tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops. ] converter.representative_dataset = representative_dataset #converter.inference_input_type = tf.int8 # or tf.uint8 #converter.inference_output_type = tf.int8 # or tf.uint8 converter.optimizations = [tf.lite.Optimize.DEFAULT] tflite_model = converter.convert()
Save the model
with open(tflite_model_path, 'wb') as f: f.write(tflite_model)
@hollance @gante I was able to convert from Hugging face whisper onnx to tflite(int8) model,however am not sure how to run the inference on this model Could you please review and let me know if there is anything i am nissing in onnx to tflite conversion https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_to_onnx_tflite_int8.ipynb
Hey @nyadla-sys -- model quantization with TFLite is beyond what we support at the moment here in transformers
, I am afraid I won't dig into your issue at the moment.
You can, however, try asking that question in our forum 🤗, you might find support from other users there.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
keep it open .!
@gante Is it possible to modify the input audio spectrograms from 30s to 10 seconds in order to use them as input for a Hugging Face Whisper TensorFlow model?
on other note if you have any clue to generate int8 model ,please share your thoughts?
@nyadla-sys
Is it possible to modify the input audio spectrograms from 30s to 10 seconds in order to use them as input for a Hugging Face Whisper TensorFlow model?
Not directly -- the model expects a fixed size input, corresponding to 30s.
if you have any clue to generate int8 model ,please share your thoughts?
I'm not an int8 expert, so I have minimal pointers: see our Optimum library, which has support for int8 quantization
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@gante I separated encoder and decoder tflite models, however while running inference of decoder I only get single output . Could you please review the notebook and let me know if you have any input for me.
Hi @nyadla-sys 👋 TF Lite is not (yet) a priority for us, as we don't have enough bandwidth to support it. I won't look at your notebook.
I was able to successfully separate the encoder and decoder whisper tflite models in the following notebook and working correctly . https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_encoder_decoder_tflite.ipynb
Posting here to help some of HF users who are interested in whisper tflite models
@sanchit-gandhi how do i get transcript from the below script
import torch
from transformers import AutoFeatureExtractor, WhisperModel
from datasets import load_dataset
model = WhisperModel.from_pretrained("openai/whisper-base")
feature_extractor = AutoFeatureExtractor.from_pretrained("openai/whisper-base")
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
inputs = feature_extractor(ds[0]["audio"]["array"], return_tensors="pt")
input_features = inputs.input_features
decoder_input_ids = torch.tensor([[1, 1]]) * model.config.decoder_start_token_id
last_hidden_state = model(input_features, decoder_input_ids=decoder_input_ids).last_hidden_state
list(last_hidden_state.shape)
@gante I am attempting to divide the TFWhisperModel into an encoder and a decoder, but the code I have is producing an error. Can you assist me in resolving this issue?
import tensorflow as tf
from transformers import TFWhisperModel
class WhisperEncoder(TFWhisperModel):
def call(self, inputs, **kwargs):
return self.encoder(inputs, **kwargs)
class WhisperDecoder(TFWhisperModel):
def call(self, inputs, **kwargs):
return self.decoder(inputs, **kwargs)
model = TFWhisperModel.from_pretrained("openai/whisper-tiny")
encoder_model = WhisperEncoder.from_pretrained("openai/whisper-tiny")
decoder_model = WhisperDecoder.from_pretrained("openai/whisper-tiny")
tf.saved_model.save(encoder_model, "whisper_encoder_model_dir")
tf.saved_model.save(decoder_model, "whisper_decoder_model_dir")
here is the error message TypeError: Exception encountered when calling layer "whisper_encoder" (type WhisperEncoder).
encoder() got an unexpected keyword argument 'training'
Call arguments received by layer "whisper_encoder" (type WhisperEncoder): • inputs={'input_features': 'tf.Tensor(shape=(2, 80, 2999), dtype=float32)', 'decoder_input_ids': 'tf.Tensor(shape=(1, 2), dtype=int32)'} • kwargs={'training': 'None'}
Hey @nyadla-sys 👋 The encoder and decoder components of Whisper, when isolated, are not compatible with from_pretrained
. However, you can still serialize them separately, from different sources:
import tensorflow as tf
from transformers import TFWhisperModel
model_1 = TFWhisperModel.from_pretrained("openai/whisper-tiny")
model_2 = TFWhisperModel.from_pretrained("openai/whisper-tiny")
tf.saved_model.save(model_1.get_encoder(), "/tmp/whisper/encoder")
tf.saved_model.save(model_2.get_decoder(), "/tmp/whisper/decoder")
Hey @nyadla-sys! For inference, we can use the .generate()
method to auto-regressively generate using the Whisper model:
import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
processor = AutoProcessor.from_pretrained("openai/whisper-base")
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
inputs = processor(ds[0]["audio"]["array"], return_tensors="pt")
input_features = inputs.input_features
with torch.no_grad():
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)
Print Output:
[' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.']
@sanchit-gandhi is it possible to generate transcript using TFWhisperModel ? instead of WhisperForConditionGeneragion
Hey @nyadla-sys!
TFWhisperModel is just the base encoder-decoder model that outputs decoder hidden-states: https://huggingface.co/docs/transformers/model_doc/whisper#transformers.TFWhisperModel
TFWhisperForConditionalGeneration adds a language modelling head on top of TFWhisperModel, mapping the decoder hidden-states to logits over the vocabulary: https://huggingface.co/docs/transformers/model_doc/whisper#transformers.TFWhisperForConditionalGeneration
So you'll need TFWhisperForConditionalGeneration in order to get logits over the vocab (and hence generate text)
Hope that makes sense!
@sanchit-gandhi Is it possible to directly map the decoder hidden states to logits without using the language modeling head? I am focusing on using only TFWhisperModel because it can be fully converted into an int8 model. I'm curious if there is any way to generate text using the decoder hidden states without adding the language modeling head.
Hey @nyadla-sys, it's precisely the job of the language modelling head to directly map the decoder hidden-states to logits. The language modelling head is a single linear layer that maps from $\mathcal{R}^{d} -> \mathcal{R}^{v}$, where $d$ is the dimensionality of the hidden-states, and $v$ is the dimensionality of the vocabulary, so for Whisper small this is a mapping from 768 -> 52000
So if you need to map to the vocabulary, you're best off using TFWhisperForConditionalGeneration
!