Error Converting Minimal DETR Implementation - I think it originates from torch.nn.Transformer()
🐞Describing the bug
- I'm trying to convert a minimal example of DETR into CoreML
- The model successfully trains and I can trace the model
- At the time of conversion I get an error about mismatched shapes in a
matmuloperation. - I think the error is coming from the decoder of the Transformer.
Stack Trace
ValueError Traceback (most recent call last)
Cell In[13], line 1
----> 1 mlmodel = ct.convert(traced, inputs=[ct.ImageType(name="image", shape=[1,3,512,512])])
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/_converters_entry.py:574, in convert(model, source, inputs, outputs, classifier_config, minimum_deployment_target, convert_to, compute_precision, skip_model_load, compute_units, package_dir, debug, pass_pipeline)
566 specification_version = _set_default_specification_version(exact_target)
568 use_default_fp16_io = (
569 specification_version is not None
570 and specification_version >= AvailableTarget.iOS16
571 and need_fp16_cast_pass
572 )
--> 574 mlmodel = mil_convert(
575 model,
576 convert_from=exact_source,
577 convert_to=exact_target,
578 inputs=inputs,
579 outputs=outputs_as_tensor_or_image_types, # None or list[ct.ImageType/ct.TensorType]
580 classifier_config=classifier_config,
581 skip_model_load=skip_model_load,
582 compute_units=compute_units,
583 package_dir=package_dir,
584 debug=debug,
585 specification_version=specification_version,
586 main_pipeline=pass_pipeline,
587 use_default_fp16_io=use_default_fp16_io,
588 )
590 if exact_target == "mlprogram" and mlmodel._input_has_infinite_upper_bound():
591 raise ValueError(
592 "For mlprogram, inputs with infinite upper_bound is not allowed. Please set upper_bound"
593 ' to a positive value in "RangeDim()" for the "inputs" param in ct.convert().'
594 )
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/converter.py:188, in mil_convert(model, convert_from, convert_to, compute_units, **kwargs)
149 @_profile
150 def mil_convert(
151 model,
(...)
155 **kwargs
156 ):
157 """
158 Convert model from a specified frontend `convert_from` to a specified
159 converter backend `convert_to`.
(...)
186 See `coremltools.converters.convert`
187 """
--> 188 return _mil_convert(model, convert_from, convert_to, ConverterRegistry, MLModel, compute_units, **kwargs)
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/converter.py:212, in _mil_convert(model, convert_from, convert_to, registry, modelClass, compute_units, **kwargs)
209 weights_dir = _tempfile.TemporaryDirectory()
210 kwargs["weights_dir"] = weights_dir.name
--> 212 proto, mil_program = mil_convert_to_proto(
213 model,
214 convert_from,
215 convert_to,
216 registry,
217 **kwargs
218 )
220 _reset_conversion_state()
222 if convert_to == 'milinternal':
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/converter.py:286, in mil_convert_to_proto(model, convert_from, convert_to, converter_registry, main_pipeline, **kwargs)
281 frontend_pipeline, backend_pipeline = _construct_other_pipelines(
282 main_pipeline, convert_from, convert_to
283 )
285 frontend_converter = frontend_converter_type()
--> 286 prog = frontend_converter(model, **kwargs)
287 PassPipelineManager.apply_pipeline(prog, frontend_pipeline)
289 PassPipelineManager.apply_pipeline(prog, main_pipeline)
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/converter.py:108, in TorchFrontend.__call__(self, *args, **kwargs)
105 def __call__(self, *args, **kwargs):
106 from .frontend.torch.load import load
--> 108 return load(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/load.py:80, in load(spec, inputs, specification_version, debug, outputs, cut_at_symbols, use_default_fp16_io, **kwargs)
69 model = _torchscript_from_spec(spec)
71 converter = TorchConverter(
72 model,
73 inputs,
(...)
77 use_default_fp16_io,
78 )
---> 80 return _perform_torch_convert(converter, debug)
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/load.py:99, in _perform_torch_convert(converter, debug)
97 def _perform_torch_convert(converter: TorchConverter, debug: bool) -> Program:
98 try:
---> 99 prog = converter.convert()
100 except RuntimeError as e:
101 if debug and "convert function" in str(e):
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/converter.py:519, in TorchConverter.convert(self)
516 self.convert_const()
518 # Add the rest of the operations
--> 519 convert_nodes(self.context, self.graph)
521 graph_outputs = [self.context[name] for name in self.graph.outputs]
523 # An output can be None when it's a None constant, which happens
524 # in Fairseq MT.
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/ops.py:88, in convert_nodes(context, graph)
85 context.quant_context.maybe_handle_quantized_inputs(node)
86 context.prepare_for_conversion(node)
---> 88 add_op(context, node)
90 if _TORCH_OPS_REGISTRY.is_inplace_op(op_lookup):
91 context.process_inplace_op(node)
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/ops.py:6411, in scaled_dot_product_attention(context, node)
6408 else:
6409 mask = attn_mask
-> 6411 res = _lower_scaled_dot_product_attention(q, k, v, mask, node.name)
6412 context.add(res)
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/ops.py:6347, in _lower_scaled_dot_product_attention(q, k, v, mask, name)
6343 q = mb.mul(x=q, y=multiplicative_scale_factor)
6345 # multiply query and key input tensors
6346 # shape of output: (target_seq, source_seq) or (B,...,target_seq, source_seq)
-> 6347 attn_weights = mb.matmul(x=q, y=k, transpose_y=True)
6349 # add mask if applicable
6350 if mask is not None:
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/mil/ops/registry.py:182, in SSAOpRegistry.register_op.<locals>.class_wrapper.<locals>.add_op(cls, **kwargs)
179 else:
180 op_cls_to_add = op_reg[op_type]
--> 182 return cls._add_op(op_cls_to_add, **kwargs)
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/mil/builder.py:184, in Builder._add_op(cls, op_cls, **kwargs)
182 curr_block()._insert_op_before(new_op, before_op=before_op)
183 new_op.build_nested_blocks()
--> 184 new_op.type_value_inference()
185 if len(new_op.outputs) == 1:
186 return new_op.outputs[0]
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/mil/operation.py:260, in Operation.type_value_inference(self, overwrite_output)
258 if not isinstance(output_types, tuple):
259 output_types = (output_types,)
--> 260 output_vals = self._auto_val(output_types)
261 try:
262 output_names = self.output_names()
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/mil/operation.py:377, in Operation._auto_val(self, output_types)
374 if do_auto_val:
375 # Is self.value_inference implemented for corresponding input?
376 try:
--> 377 vals = self.value_inference()
378 except NotImplementedError:
379 do_auto_val = False
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/mil/operation.py:111, in precondition.<locals>.decorator.<locals>.wrapper(self)
109 raise NotImplementedError(msg.format(self.op_type))
110 else:
--> 111 return func(self)
File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/mil/ops/defs/iOS15/linear.py:231, in matmul.value_inference(self)
229 if self.transpose_y.val:
230 y = np.transpose(y)
--> 231 return np.matmul(x, y)
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 8 is different from 32)
To Reproduce
- Please add a minimal code example that can reproduce the error when running it.
import torch
from torch import nn
import coremltools as ct
import numpy as np
from torchvision.models import resnet50, ResNet50_Weights
class DETRdemo(nn.Module):
"""
Demo DETR implementation.
Demo implementation of DETR in minimal number of lines, with the
following differences wrt DETR in the paper:
* learned positional encoding (instead of sine)
* positional encoding is passed at input (instead of attention)
* fc bbox predictor (instead of MLP)
The model achieves ~40 AP on COCO val5k and runs at ~28 FPS on Tesla V100.
Only batch size 1 supported.
"""
def __init__(self, num_classes, hidden_dim=256, nheads=8,
num_encoder_layers=6, num_decoder_layers=6):
super().__init__()
# create ResNet-50 backbone
self.backbone = resnet50()
del self.backbone.fc
# create conversion layer
self.conv = nn.Conv2d(2048, hidden_dim, 1)
# create a default PyTorch transformer
self.transformer = nn.Transformer(
hidden_dim, nheads, num_encoder_layers, num_decoder_layers)
# prediction heads, one extra class for predicting non-empty slots
# note that in baseline DETR linear_bbox layer is 3-layer MLP
self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
# output positional encodings (object queries)
self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))
# spatial positional encodings
# note that in baseline DETR we use sine positional encodings
self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
def forward(self, inputs):
# propagate inputs through ResNet-50 up to avg-pool layer
x = self.backbone.conv1(inputs)
x = self.backbone.bn1(x)
x = self.backbone.relu(x)
x = self.backbone.maxpool(x)
x = self.backbone.layer1(x)
x = self.backbone.layer2(x)
x = self.backbone.layer3(x)
x = self.backbone.layer4(x)
# convert from 2048 to 256 feature planes for the transformer
h = self.conv(x)
# construct positional encodings
H, W = h.shape[-2:]
pos = torch.cat([
self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),
], dim=-1).flatten(0, 1).unsqueeze(1)
# propagate through the transformer
h = self.transformer(pos + 0.1 * h.flatten(2).permute(2, 0, 1),
self.query_pos.unsqueeze(1)).transpose(0, 1)
# finally project transformer outputs to class labels and bounding boxes
return self.linear_class(h)
m = DETRdemo(5).eval()
dummy_input = torch.rand((1,3,512,512))
traced = torch.jit.trace(m, dummy_input)
mlmodel = ct.convert(traced, inputs=[ct.ImageType(name="image", shape=[1,3,512,512])])
- If the model conversion succeeds, but there is a numerical mismatch in predictions, please include the code used for comparisons.
System environment (please complete the following information):
- coremltools version: 7.0
- OS (e.g. MacOS version or Linux type): Tried on MacOS 14.2.1 and a Ubuntu VM with the same result
- Any other relevant version information (e.g. PyTorch or TensorFlow version): Python 3.10, Pytorch 2.0.0+cu118 and 2.0.1
Additional context
- At no other point do I get an error about matrix multiplication, this seems very odd
@kells1986 - Thanks for reporting this bug with code to reproduce it.
Looks like we have a bug in the value_inference method for our matmul op. Strictly speaking value_inference methods are not required. They are basically just optimizations. As a workaround, you can delete the value_inference method in the matmul class (in linear.py) from your local installed copy of coremltools.
Also we don't support converting PyTorch model with image type inputs. So in your final line, you need to change ct.ImageType to ct.TensorType.
Once those two changes are made, the model can be converted and the predictions match with high accuracy.
I'll leave this issue open until we fix the value_inference bug.
Thanks @TobyRoseman this solved the issue for the model I posted above.
I have two models that gave me the same problem so I posted the simplest code to reproduce the error. With your suggested fix the example above now works end-to-end.
However, in my second model I now get a different error. I've reduced the code down to a simple script:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import BertModel, BertConfig
from transformers import DistilBertModel, DistilBertConfig
from transformers import MobileBertConfig, MobileBertModel
from transformers import AlbertModel, AlbertConfig
import numpy as np
import coremltools as ct
class FeedForward(nn.Module):
def __init__(self, dims):
super().__init__()
self.layers = nn.ModuleList()
for i in range(len(dims) - 1):
self.layers.append(nn.Linear(dims[i], dims[i + 1]))
self.relu = nn.ReLU()
def forward(self, x) -> torch.Tensor:
for i in range(len(self.layers) - 1):
x = self.layers[i](x)
x = self.relu(x)
x = self.layers[-1](x)
return x
def get_encoder(model_name):
if model_name == "distilbert-base-uncased":
model = DistilBertModel.from_pretrained(model_name)
config = DistilBertConfig.from_pretrained(model_name)
return model, config
elif model_name == "ybelkada/tiny-mobilebertmodel":
model = MobileBertModel.from_pretrained(model_name)
config = MobileBertConfig.from_pretrained(model_name)
return model, config
elif model_name == "albert-base-v2":
model = AlbertModel.from_pretrained(model_name, torchscript=True)
config = AlbertConfig.from_pretrained(model_name)
return model, config
else:
raise ValueError(f"Unknown model name: {model_name}")
class TextDETR(nn.Module):
def __init__(self,
num_classes,
num_meals,
num_actions,
nheads=2,
num_decoder_layers=4,
num_queries=20,
transformer_dropout=0.1,
classification_dropout=0.1,
meals_dropout=0.1,
actions_dropout=0.1,
bert_model_name="albert-base-v2"):
super().__init__()
# Load pre-trained BERT model as the encoder
self.bert_encoder, config = get_encoder(bert_model_name)
hidden_dim = config.hidden_size
# Transformer decoder (assuming custom or another pre-trained model can be adapted similarly)
self.transformer_decoder = nn.TransformerDecoder(
nn.TransformerDecoderLayer(
d_model=hidden_dim, nhead=nheads, dropout=transformer_dropout
),
num_layers=num_decoder_layers,
)
self.classification_dropout = nn.Dropout(classification_dropout)
self.meals_dropout = nn.Dropout(meals_dropout)
self.actions_dropout = nn.Dropout(actions_dropout)
# Prediction head for class labels
self.linear_class = FeedForward([hidden_dim, hidden_dim * 4, num_classes + 1])
self.linear_meal = FeedForward([hidden_dim, hidden_dim * 4, num_meals + 1])
self.linear_action = FeedForward([hidden_dim, hidden_dim * 4, num_actions + 1])
# Output positional encodings (object queries)
self.query_pos = nn.Parameter(torch.rand(num_queries, hidden_dim))
def forward(self, x, mask):
encoder_outputs, _ = self.bert_encoder(
input_ids=x, attention_mask=mask
)
# Extract the last hidden state as encoder output
encoder_hidden_states = encoder_outputs.permute(1, 0, 2)
# Decoder input: object queries + positional encodings
# Note: You might need to adjust this part depending on how you design the decoder inputs
decoder_input = self.query_pos.unsqueeze(1).repeat(1, x.size(0), 1)
# Transformer Decoder
transformer_output = self.transformer_decoder(
tgt=decoder_input,
memory=encoder_hidden_states,
memory_key_padding_mask=~mask.bool(),
)
# Project transformer outputs to class labels
pred_class_logits = self.linear_class(
self.classification_dropout(transformer_output.permute(1, 0, 2))
)
pred_meal_logits = self.linear_meal(
self.meals_dropout(transformer_output.permute(1, 0, 2))
)
pred_action_logits = self.linear_action(
self.actions_dropout(transformer_output.permute(1, 0, 2))
)
return pred_class_logits, pred_meal_logits, pred_action_logits
model = TextDETR(20,3,2).eval()
random_tokens = torch.randint(0, 30000, (1, 512))
mask = torch.ones(1,512)
traced_model = torch.jit.trace(model, (random_tokens, mask))
mlmodel = ct.convert(
traced_model,
inputs=[ct.TensorType(name="input_ids", shape=(1,ct.RangeDim(1, 512)), dtype=np.int32),\
ct.TensorType(name="mask", shape=(1,ct.RangeDim(1, 512)), dtype=np.int32)]
)
The error I get now is ValueError: Cannot add const (is41, is36, 2, 768).
Again this model traces and scripts with no errors from the torch.jit functions. I can also convert it to ONNX and get a working model, so I'm assuming this is something on the CoreML side.
If you could shed some light on this error I'd be very grateful.
If I roll back to version 6.3.0 I get the error message: RuntimeError: PyTorch convert function for op 'unflatten' not implemented.
I don't see any calls to unflatten in the pytorch source or in my model though, and I've ruled out the call to unsqueeze as the issue
If I roll back to version 6.3.0 I get the error message:
RuntimeError: PyTorch convert function for op 'unflatten' not implemented.I don't see any calls to
unflattenin the pytorch source or in my model though, and I've ruled out the call tounsqueezeas the issue
We lower PyTorch ops prior to conversion. One of your ops must be using unflatten once that op is lowered. I don't think using a previous version of coremltools makes sense here.
Ok, so once you apply my workaround, you're getting a new error with a different model. I can reproduce this too.
This problem looks totally separate from the original problem. We should probably have a separate GitHub issue for it.
This reshape is failing while trying to convert a (lowered) unflatten PyTorch ops. This looks related to your flexible shape usage. I would expect this to work a static shape.
@TobyRoseman Thanks, changing shape=(1,ct.RangeDim(1, 512)) -> shape=(1, 512) works.
Interestingly I have to add compute_precision=ct.precision.FLOAT32 to the conversion call in order to get the same results with CoreML that I get with pytorch. I've never seen this before, usually Float16 works very well.
Do you have any suggestions as to why this might be the case? I suspect it's not a coreML issue and is more related to my training loop.
How much different are the values from the Float16 model? Are they completely wrong?
I think what might be happening here is that the neuralnetwork model type is being used for compute_precision=ct.precision.FLOAT32 and mlprogram model type used for Float16.
Instead of specifying the compute_precision parameter. Try using the convert_to parameter directly. Convert the model once with convert_to='mlprogram' and once with convert_to='neuralnetwork' Does the neuralnetwork output match but not the mlprogram output?
The values aren't too far out, but just enough to make the predictions unreliable.
I found that setting the precision to FLOAT32 and then running this code:
import coremltools.optimize.coreml as cto
op_config = cto.OpLinearQuantizerConfig(mode="linear_symmetric", weight_threshold=512)
config = cto.OptimizationConfig(global_config=op_config)
compressed_8_bit_model = cto.linear_quantize_weights(mlmodel, config=config)
Resulted in an accurate 8-bit model.
I will try your suggestion today and report back