exporters Export & use T5-Base model for summarization

Hey guys,

I'm pretty new to CoreML conversion stuff and took the naive approach of converting a T5-Base model to CoreML (I want to use it to generate summarisations). As layed out in the README I created an encoder and a decoder model, which worked without a problem:

(base) me@me-MacBook-Pro ~/Development/projects/exporters$ python -m exporters.coreml --model=t5-small --feature=text2text-generation exported                                                      ✭main 
scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.0.0 has not been tested with coremltools. You may run into unexpected errors. Torch 1.12.1 is the most recent version that has been tested.
Converting encoder model...
Using framework PyTorch: 2.0.0
Overriding 1 configuration item(s)
	- use_cache -> False
Skipping token_type_ids input
Converting PyTorch Frontend ==> MIL Ops: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 755/756 [00:00<00:00, 2482.08 ops/s]
Running MIL Common passes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 39/39 [00:00<00:00, 73.01 passes/s]
Running MIL Clean up passes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 27.71 passes/s]
Validating Core ML model...
	-[✓] Core ML model output names match reference model ({'last_hidden_state'})
	- Validating Core ML model output "last_hidden_state":
		-[✓] (1, 128, 768) matches (1, 128, 768)
		-[✓] all values close (atol: 0.0001)
All good, model saved at: exported/encoder_Model.mlpackage
Converting decoder model...
Using framework PyTorch: 2.0.0
Overriding 1 configuration item(s)
	- use_cache -> False
/opt/homebrew/Caskroom/miniconda/base/lib/python3.9/site-packages/transformers/modeling_utils.py:828: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if causal_mask.shape[1] < attention_mask.shape[1]:
Skipping token_type_ids input
Tuple detected at graph output. This will be flattened in the converted model.
Converting PyTorch Frontend ==> MIL Ops: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 1260/1262 [00:00<00:00, 2404.55 ops/s]
Running MIL Common passes:   5%|████████▊                                                                                                                                                                  | 2/39 [00:00<00:02, 15.47 passes/s]/opt/homebrew/Caskroom/miniconda/base/lib/python3.9/site-packages/coremltools/converters/mil/mil/passes/name_sanitization_utils.py:135: UserWarning: Output, '1761', of the source model, has been renamed to 'var_1761' in the Core ML model.
  warnings.warn(msg.format(var.name, new_name))
Running MIL Common passes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 39/39 [00:01<00:00, 36.73 passes/s]
Running MIL Clean up passes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 14.41 passes/s]
Validating Core ML model...
	-[✓] Core ML model output names match reference model ({'logits'})
	- Validating Core ML model output "logits":
		-[✓] (1, 64, 32100) matches (1, 64, 32100)
		-[✓] all values close (atol: 0.0001)
All good, model saved at: exported/decoder_Model.mlpackage

This is where the fun begins :) I've only ever worked with the t5 model through transformers & pipelines. Like this:

from torchvision import models
from torchsummary import summary

from transformers import T5TokenizerFast, T5ForConditionalGeneration, pipeline

text = "summarise: The quick brown fox jumps over the lazy dog"
tokenizer = T5TokenizerFast.from_pretrained("t-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base", return_dict=True)
model.to('cuda')

tokens = tokenizer(text, return_tensors="pt")
input_ids = tokens.input_ids

outputs = model.generate(input_ids.cuda(), max_length=40)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

As far as I understand by using the model.generate method the transformers utilities do all the heavy lifting here like creating the attention_masks, running the encoder, passing the encoder_hidden_states along, etc. pp. Am I right to assume that I would have to implement all this functionality by hand if I want to work with the CoreML encoder / decoder models?

I'm not only worried about using them in Python, but would also like to use them in Swift. But I guess there's no easy plug'n play solution here, right? :)

May 03 '23 14:05 seboslaw

Indeed you would have to manage all that stuff yourself.

Edit: It might be useful if we provided some Swift wrapper code for this that would hide the complexity (since it's the same for most Transformer models) but right now we don't have this.

May 03 '23 14:05 hollance

yikes! I was ready to put my gloves on, but I've spent two days now trying to get the encoder / decoder models to run in python without going through model.generate without success (except generating gibberish sentences :)

May 03 '23 14:05 seboslaw

@hollance Hey, I came around of implementing "that stuff" and have it running in Swift on MacOS and iOS now :) However, the converted model runs exclusively on the CPU (although the Performance Report suggests that some layers are available for GPU / ANE processing - s. screenshot). Is there anything I can do to make this happen? Right now it works, but it's rather slow.

May 12 '23 15:05 seboslaw

Hi @seboslaw!

I've recently done a similar exercise, and discovered that if the model accepts flexible shapes, then Core ML only uses the CPU. In the case of sequence-to-sequence models such as T5, the decoder is configured to accept inputs whose length is unbounded, as you can see in the Predictions tab of Xcode (1 x 1... means a batch size of 1 and a sequence length of at least 1, with no upper bound):

I tried to work around this issue by using fixed shapes, but so far I've only tested autoregressive models. Using a fixed sequence length of, say, 128, makes it possible for Core ML to engage the GPU (even though the ANE is still unused). I'm not sure if this is practical or even possible for the model you are interested in, as the sequence length depends a lot on your particular use case.

In addition, using fixed shapes requires that you prepare your inputs using padding and the appropriate attention masks, which is a bit more work to be done in the Swift code.

This is a very interesting area for us, and as Matthijs mentioned we are considering whether to create some Swift wrappers and a set of "best practices" for conversion to help with these tasks. (No promises though, we're still assessing the problem :)

May 12 '23 18:05 pcuenca

Hey @pcuenca, thx for your reply!

I've tried your suggestion (I think I did :) and updated the upperBounds of the input parameters. However, the Performance Report still says "CPU only" (see below) :(

I used coremltools to edit the inputs of my already converted decoder model:

import coremltools
import coremltools.proto.FeatureTypes_pb2 as ft
model = coremltools.models.MLModel('../Common/dec.mlpackage')
spec = model.get_spec()
model = coremltools.models.MLModel(spec, weights_dir=model.weights_dir) # if model is an mlprogram

input = spec.description.input[0]
input.type.multiArrayType.shapeRange.sizeRanges[1].upperBound = 128

input = spec.description.input[1]
input.type.multiArrayType.shapeRange.sizeRanges[1].upperBound = 1

input = spec.description.input[2]
input.type.multiArrayType.shapeRange.sizeRanges[1].upperBound = 1

input = spec.description.input[3]
input.type.multiArrayType.shapeRange.sizeRanges[1].upperBound = 1

output = spec.description.output[0]
output.type.multiArrayType.shapeRange.sizeRanges[1].upperBound = 1

# print(output)

model = coremltools.models.MLModel(spec, weights_dir=model.weights_dir)
model.save("YourNewModel.mlpackage")

Since this didn't seem to work I looked into providing the inputs to the hf exporters tool directly. But then I saw that "The sequence_length specified in the configuration object is ignored" if "seq2seq" is provided.

May 12 '23 21:05 seboslaw

I've tried your suggestion (I think I did :) and updated the upperBounds of the input parameters

Sorry, I think I wasn't clear. I didn't mean to make the upper limit bounded, but to use fixed shapes for all dimensions. This is an example of a model where Core ML uses the GPU for all operations:

The first dimension is always 1, and the second dimension is always 128. My apologies for the confusion!

May 14 '23 17:05 pcuenca

Hey @pcuenca,

no worries - you were clear, I simply lack experience with the exporter :) I think I understand what needs to be done now, however, it seems that exporters currently doesn't support this, right?

I need to export the T5 as two separate models, thus providing the seq2seq parameter to my custom MLConfig. However, as the README states that if I set sequence_length in my custom MLConfig, it will be ignored:

https://github.com/huggingface/exporters/tree/20e849200d2e4fb29711a7ed8f37c7a16234e60f#exporting-an-encoder-decoder-model

The sequence_length specified in the configuration object is ignored if "seq2seq" is provided.

Why is it this way anyway? And is there a way to get this done aside from patching convert.py?

This is what I've started with (only decoder_input_ids for now):

from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer
from exporters.coreml import export
from exporters.coreml.models import T5CoreMLConfig
from transformers import T5TokenizerFast, T5ForConditionalGeneration
from collections import OrderedDict, UserDict
from exporters.coreml.config import InputDescription

class MyCoreMLConfig(T5CoreMLConfig):
    @property
    def inputs(self) -> OrderedDict[str, InputDescription]:
        input_descs = super().inputs
        input_descs["decoder_input_ids"].sequence_length = 128
        return input_descs

model_ckpt = "Einmalumdiewelt/T5-Base_GNAD"
base_model = T5ForConditionalGeneration.from_pretrained(model_ckpt, torchscript=True)
preprocessor = T5TokenizerFast.from_pretrained(model_ckpt)

coreml_config = MyCoreMLConfig(base_model.config, task="text2text-generation", seq2seq="decoder")
decoder_mlmodel = export(preprocessor, base_model, coreml_config)

decoder_mlmodel.save('Test.mlpackage')

May 15 '23 12:05 seboslaw

In the meantime I've tried editing the (using the exporter) exported MLModel through coremltools:

import coremltools
import numpy as np
import coremltools.proto.FeatureTypes_pb2 as ft
from coremltools.proto import FeatureTypes_pb2

model = coremltools.models.MLModel('../Common/dec.mlpackage')
spec = model.get_spec()
model = coremltools.models.MLModel(spec, weights_dir=model.weights_dir) # if model is an mlprogram

input = spec.description.input[0]

# Create a new MultiArrayType
new_type = FeatureTypes_pb2.ArrayFeatureType()
new_type.shape.extend([1, 128])
new_type.dataType = FeatureTypes_pb2.ArrayFeatureType.INT32

# Replace the old type with the new one
input.type.multiArrayType.CopyFrom(new_type)

model = coremltools.models.MLModel(spec, weights_dir=model.weights_dir)
model.save("YourNewModel.mlpackage")

However, I receive this error:

/opt/homebrew/lib/python3.10/site-packages/coremltools/models/model.py:146: RuntimeWarning: You will not be able to run predict() on this Core ML model. Underlying exception message was: Error compiling model: "compiler error:  Encountered an error while compiling a neural network model: validator error: Model input 'decoder_input_ids' has a different shape than its corresponding parameter to main.".
  _warnings.warn(

So as far as I understand modifying an exported MLModel is off the table. @pcuenca Do you think doing it the way described in my prev post will be possible?

May 17 '23 09:05 seboslaw

Testing T5 is high up in my to-do list, I hope to get to it pretty soon and hopefully I'll have some insight then :) Sorry for the no-answer though.

May 17 '23 10:05 pcuenca

@pcuenca no worries and I totally understand :) Could you tell me real quick though why the sequence_length specified in the configuration object is ignored if "seq2seq" is provided? That way I can maybe start digging into the exporters implementation and try to fix it on my end.

May 17 '23 11:05 seboslaw

I think I originally made it ignore the sequence_length because seq2seq models always need variable-length inputs. Well, unless you're trying to work around Core ML limitations, I guess. ;-)

May 17 '23 12:05 hollance

@seboslaw What you tried to do here used to work, but in newer versions of Core ML it results in the error you've seen. The problem is that the model was compiled with flexible shapes and this is inconsistent with the (fixed) shape you assign later on.

I'm working in a local branch with some quick and dirty modifications to convert T5 using fixed shapes. I can push it later today so that you can keep testing on your end.

May 18 '23 14:05 pcuenca

@seboslaw This is the branch: https://github.com/huggingface/exporters/pull/37. I have other local changes, so I hope I didn't break or miss anything. I verified that T5 encoder and decoder export with fixed shapes for all their inputs, and that Xcode's performance report successfully chooses the GPU for all operations. I haven't tried to run inference inside an app yet.

May 18 '23 15:05 pcuenca

@pcuenca awesome! I’ll give it a try as soon as I’m in front of my computer. Thanks a lot already for the effort!

May 18 '23 16:05 seboslaw

@pcuenca I tried it, but unfortunately it gives different results when compared to the non-GPU model. Hopefully, I simply messed up the padding. Right now I'm focussing on the decoder. I padded as follows:

decoder_input_ids: padded with 0s
decoder_attention_mask: leading 1s the size of the unpadded decoder_input_ids
encoder_last_hidden_state (1 x 128 x 768): padded the 2nd dimension (formerly 104, now 128) with zero-filled [768]  arrays/tensors
encoder_attention_mask: leading 1s the size of the unpadded decoder_input_ids

Would you say that's correct?

EDIT: Another problem I found is that the decoder_output.token_scores have the "wrong" dimension. Before my decoder inputs on the very first run looked like this:

decoder_input_ids: [0]
decoder_attention_mask: [1]
encoder_last_hidden_state: Array with dim 1x104x768
encoder_attention_mask: [1]

decoder_output.token_scores then had the output dimension: 1x1x768.

With the new model my inputs look like this:

decoder_input_ids: [0,0,0,....0] (dim=128)
decoder_attention_mask: [1,0,0,0,0,....0] (dim=128)
encoder_last_hidden_state: MLMultiArray with dim 1x128x768 (the last 24 tensors of the 2nd dim are filled with 0s)
encoder_attention_mask: [1,0,0,0,0,....0] (dim=128)

decoder_output.token_scores now has the output dimension: 1x128x768.

I'm not experienced with the sec2sec model architecture, but aren't the attention_masks supposed to suppress the additional decoder_input_ids entries/padding?

May 19 '23 10:05 seboslaw

@seboslaw Did you get summarization to work in Swift? How did you implement it? I converted the model, but don't know how to use it, and wasn't able to find much information online.

Jul 25 '23 23:07 rishaandesai

exporters exporters copied to clipboard

Export & use T5-Base model for summarization

exporters
exporters copied to clipboard