coremltools Error Converting Minimal DETR Implementation - I think it originates from torch.nn.Transformer()

🐞Describing the bug

I'm trying to convert a minimal example of DETR into CoreML
The model successfully trains and I can trace the model
At the time of conversion I get an error about mismatched shapes in a matmul operation.
I think the error is coming from the decoder of the Transformer.

Stack Trace

ValueError                                Traceback (most recent call last)
Cell In[13], line 1
----> 1 mlmodel = ct.convert(traced, inputs=[ct.ImageType(name="image", shape=[1,3,512,512])])

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/_converters_entry.py:574, in convert(model, source, inputs, outputs, classifier_config, minimum_deployment_target, convert_to, compute_precision, skip_model_load, compute_units, package_dir, debug, pass_pipeline)
    566     specification_version = _set_default_specification_version(exact_target)
    568 use_default_fp16_io = (
    569     specification_version is not None
    570     and specification_version >= AvailableTarget.iOS16
    571     and need_fp16_cast_pass
    572 )
--> 574 mlmodel = mil_convert(
    575     model,
    576     convert_from=exact_source,
    577     convert_to=exact_target,
    578     inputs=inputs,
    579     outputs=outputs_as_tensor_or_image_types,  # None or list[ct.ImageType/ct.TensorType]
    580     classifier_config=classifier_config,
    581     skip_model_load=skip_model_load,
    582     compute_units=compute_units,
    583     package_dir=package_dir,
    584     debug=debug,
    585     specification_version=specification_version,
    586     main_pipeline=pass_pipeline,
    587     use_default_fp16_io=use_default_fp16_io,
    588 )
    590 if exact_target == "mlprogram" and mlmodel._input_has_infinite_upper_bound():
    591     raise ValueError(
    592         "For mlprogram, inputs with infinite upper_bound is not allowed. Please set upper_bound"
    593         ' to a positive value in "RangeDim()" for the "inputs" param in ct.convert().'
    594     )

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/converter.py:188, in mil_convert(model, convert_from, convert_to, compute_units, **kwargs)
    149 @_profile
    150 def mil_convert(
    151     model,
   (...)
    155     **kwargs
    156 ):
    157     """
    158     Convert model from a specified frontend `convert_from` to a specified
    159     converter backend `convert_to`.
   (...)
    186         See `coremltools.converters.convert`
    187     """
--> 188     return _mil_convert(model, convert_from, convert_to, ConverterRegistry, MLModel, compute_units, **kwargs)

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/converter.py:212, in _mil_convert(model, convert_from, convert_to, registry, modelClass, compute_units, **kwargs)
    209     weights_dir = _tempfile.TemporaryDirectory()
    210     kwargs["weights_dir"] = weights_dir.name
--> 212 proto, mil_program = mil_convert_to_proto(
    213                         model,
    214                         convert_from,
    215                         convert_to,
    216                         registry,
    217                         **kwargs
    218                      )
    220 _reset_conversion_state()
    222 if convert_to == 'milinternal':

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/converter.py:286, in mil_convert_to_proto(model, convert_from, convert_to, converter_registry, main_pipeline, **kwargs)
    281 frontend_pipeline, backend_pipeline = _construct_other_pipelines(
    282     main_pipeline, convert_from, convert_to
    283 )
    285 frontend_converter = frontend_converter_type()
--> 286 prog = frontend_converter(model, **kwargs)
    287 PassPipelineManager.apply_pipeline(prog, frontend_pipeline)
    289 PassPipelineManager.apply_pipeline(prog, main_pipeline)

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/converter.py:108, in TorchFrontend.__call__(self, *args, **kwargs)
    105 def __call__(self, *args, **kwargs):
    106     from .frontend.torch.load import load
--> 108     return load(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/load.py:80, in load(spec, inputs, specification_version, debug, outputs, cut_at_symbols, use_default_fp16_io, **kwargs)
     69     model = _torchscript_from_spec(spec)
     71 converter = TorchConverter(
     72     model,
     73     inputs,
   (...)
     77     use_default_fp16_io,
     78 )
---> 80 return _perform_torch_convert(converter, debug)

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/load.py:99, in _perform_torch_convert(converter, debug)
     97 def _perform_torch_convert(converter: TorchConverter, debug: bool) -> Program:
     98     try:
---> 99         prog = converter.convert()
    100     except RuntimeError as e:
    101         if debug and "convert function" in str(e):

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/converter.py:519, in TorchConverter.convert(self)
    516 self.convert_const()
    518 # Add the rest of the operations
--> 519 convert_nodes(self.context, self.graph)
    521 graph_outputs = [self.context[name] for name in self.graph.outputs]
    523 # An output can be None when it's a None constant, which happens
    524 # in Fairseq MT.

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/ops.py:88, in convert_nodes(context, graph)
     85 context.quant_context.maybe_handle_quantized_inputs(node)
     86 context.prepare_for_conversion(node)
---> 88 add_op(context, node)
     90 if _TORCH_OPS_REGISTRY.is_inplace_op(op_lookup):
     91     context.process_inplace_op(node)

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/ops.py:6411, in scaled_dot_product_attention(context, node)
   6408     else:
   6409         mask = attn_mask
-> 6411 res = _lower_scaled_dot_product_attention(q, k, v, mask, node.name)
   6412 context.add(res)

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/ops.py:6347, in _lower_scaled_dot_product_attention(q, k, v, mask, name)
   6343 q = mb.mul(x=q, y=multiplicative_scale_factor)
   6345 # multiply query and key input tensors
   6346 # shape of output: (target_seq, source_seq) or (B,...,target_seq, source_seq)
-> 6347 attn_weights = mb.matmul(x=q, y=k, transpose_y=True)
   6349 # add mask if applicable
   6350 if mask is not None:

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/mil/ops/registry.py:182, in SSAOpRegistry.register_op.<locals>.class_wrapper.<locals>.add_op(cls, **kwargs)
    179 else:
    180     op_cls_to_add = op_reg[op_type]
--> 182 return cls._add_op(op_cls_to_add, **kwargs)

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/mil/builder.py:184, in Builder._add_op(cls, op_cls, **kwargs)
    182 curr_block()._insert_op_before(new_op, before_op=before_op)
    183 new_op.build_nested_blocks()
--> 184 new_op.type_value_inference()
    185 if len(new_op.outputs) == 1:
    186     return new_op.outputs[0]

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/mil/operation.py:260, in Operation.type_value_inference(self, overwrite_output)
    258 if not isinstance(output_types, tuple):
    259     output_types = (output_types,)
--> 260 output_vals = self._auto_val(output_types)
    261 try:
    262     output_names = self.output_names()

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/mil/operation.py:377, in Operation._auto_val(self, output_types)
    374 if do_auto_val:
    375     # Is self.value_inference implemented for corresponding input?
    376     try:
--> 377         vals = self.value_inference()
    378     except NotImplementedError:
    379         do_auto_val = False

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/mil/operation.py:111, in precondition.<locals>.decorator.<locals>.wrapper(self)
    109     raise NotImplementedError(msg.format(self.op_type))
    110 else:
--> 111     return func(self)

File /opt/conda/lib/python3.10/site-packages/coremltools/converters/mil/mil/ops/defs/iOS15/linear.py:231, in matmul.value_inference(self)
    229 if self.transpose_y.val:
    230     y = np.transpose(y)
--> 231 return np.matmul(x, y)

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 8 is different from 32)

To Reproduce

Please add a minimal code example that can reproduce the error when running it.

import torch
from torch import nn
import coremltools as ct
import numpy as np
from torchvision.models import resnet50, ResNet50_Weights

class DETRdemo(nn.Module):
    """
    Demo DETR implementation.

    Demo implementation of DETR in minimal number of lines, with the
    following differences wrt DETR in the paper:
    * learned positional encoding (instead of sine)
    * positional encoding is passed at input (instead of attention)
    * fc bbox predictor (instead of MLP)
    The model achieves ~40 AP on COCO val5k and runs at ~28 FPS on Tesla V100.
    Only batch size 1 supported.
    """
    def __init__(self, num_classes, hidden_dim=256, nheads=8,
                 num_encoder_layers=6, num_decoder_layers=6):
        super().__init__()

        # create ResNet-50 backbone
        self.backbone = resnet50()
        del self.backbone.fc

        # create conversion layer
        self.conv = nn.Conv2d(2048, hidden_dim, 1)

        # create a default PyTorch transformer
        self.transformer = nn.Transformer(
            hidden_dim, nheads, num_encoder_layers, num_decoder_layers)

        # prediction heads, one extra class for predicting non-empty slots
        # note that in baseline DETR linear_bbox layer is 3-layer MLP
        self.linear_class = nn.Linear(hidden_dim, num_classes + 1)

        # output positional encodings (object queries)
        self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))

        # spatial positional encodings
        # note that in baseline DETR we use sine positional encodings
        self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
        self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))

    def forward(self, inputs):
        # propagate inputs through ResNet-50 up to avg-pool layer
        x = self.backbone.conv1(inputs)
        x = self.backbone.bn1(x)
        x = self.backbone.relu(x)
        x = self.backbone.maxpool(x)

        x = self.backbone.layer1(x)
        x = self.backbone.layer2(x)
        x = self.backbone.layer3(x)
        x = self.backbone.layer4(x)

        # convert from 2048 to 256 feature planes for the transformer
        h = self.conv(x)

        # construct positional encodings
        H, W = h.shape[-2:]
        pos = torch.cat([
            self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
            self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),
        ], dim=-1).flatten(0, 1).unsqueeze(1)

        # propagate through the transformer
        h = self.transformer(pos + 0.1 * h.flatten(2).permute(2, 0, 1),
                             self.query_pos.unsqueeze(1)).transpose(0, 1)
        
        # finally project transformer outputs to class labels and bounding boxes
        return self.linear_class(h)

m = DETRdemo(5).eval()
dummy_input = torch.rand((1,3,512,512))
traced = torch.jit.trace(m, dummy_input)
mlmodel = ct.convert(traced, inputs=[ct.ImageType(name="image", shape=[1,3,512,512])])

If the model conversion succeeds, but there is a numerical mismatch in predictions, please include the code used for comparisons.

System environment (please complete the following information):

coremltools version: 7.0
OS (e.g. MacOS version or Linux type): Tried on MacOS 14.2.1 and a Ubuntu VM with the same result
Any other relevant version information (e.g. PyTorch or TensorFlow version): Python 3.10, Pytorch 2.0.0+cu118 and 2.0.1

Additional context

At no other point do I get an error about matrix multiplication, this seems very odd

Feb 27 '24 12:02 kells1986

@kells1986 - Thanks for reporting this bug with code to reproduce it.

Looks like we have a bug in the value_inference method for our matmul op. Strictly speaking value_inference methods are not required. They are basically just optimizations. As a workaround, you can delete the value_inference method in the matmul class (in linear.py) from your local installed copy of coremltools.

Also we don't support converting PyTorch model with image type inputs. So in your final line, you need to change ct.ImageType to ct.TensorType.

Once those two changes are made, the model can be converted and the predictions match with high accuracy.

I'll leave this issue open until we fix the value_inference bug.

Feb 28 '24 20:02 TobyRoseman

Thanks @TobyRoseman this solved the issue for the model I posted above.

I have two models that gave me the same problem so I posted the simplest code to reproduce the error. With your suggested fix the example above now works end-to-end.

However, in my second model I now get a different error. I've reduced the code down to a simple script:

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import BertModel, BertConfig
from transformers import DistilBertModel, DistilBertConfig
from transformers import MobileBertConfig, MobileBertModel
from transformers import AlbertModel, AlbertConfig
import numpy as np
import coremltools as ct

class FeedForward(nn.Module):
    def __init__(self, dims):
        super().__init__()

        self.layers = nn.ModuleList()
        for i in range(len(dims) - 1):
            self.layers.append(nn.Linear(dims[i], dims[i + 1]))

        self.relu = nn.ReLU()

    def forward(self, x) -> torch.Tensor:
        for i in range(len(self.layers) - 1):
            x = self.layers[i](x)
            x = self.relu(x)
        x = self.layers[-1](x)
        return x


def get_encoder(model_name):
    if model_name == "distilbert-base-uncased":
        model = DistilBertModel.from_pretrained(model_name)
        config = DistilBertConfig.from_pretrained(model_name)
        return model, config
    elif model_name == "ybelkada/tiny-mobilebertmodel":
        model = MobileBertModel.from_pretrained(model_name)
        config = MobileBertConfig.from_pretrained(model_name)
        return model, config
    elif model_name == "albert-base-v2":
        model = AlbertModel.from_pretrained(model_name, torchscript=True)
        config = AlbertConfig.from_pretrained(model_name)
        return model, config
    else:
        raise ValueError(f"Unknown model name: {model_name}")

        

class TextDETR(nn.Module):
    
    def __init__(self,
        num_classes,
        num_meals,
        num_actions,
        nheads=2,
        num_decoder_layers=4,
        num_queries=20,
        transformer_dropout=0.1,
        classification_dropout=0.1,
        meals_dropout=0.1,
        actions_dropout=0.1,
        bert_model_name="albert-base-v2"):
        
        super().__init__()

        # Load pre-trained BERT model as the encoder
        self.bert_encoder, config = get_encoder(bert_model_name)
        hidden_dim = config.hidden_size

        # Transformer decoder (assuming custom or another pre-trained model can be adapted similarly)
        self.transformer_decoder = nn.TransformerDecoder(
            nn.TransformerDecoderLayer(
                d_model=hidden_dim, nhead=nheads, dropout=transformer_dropout
            ),
            num_layers=num_decoder_layers,
        )

        self.classification_dropout = nn.Dropout(classification_dropout)
        self.meals_dropout = nn.Dropout(meals_dropout)
        self.actions_dropout = nn.Dropout(actions_dropout)

        # Prediction head for class labels
        self.linear_class = FeedForward([hidden_dim, hidden_dim * 4, num_classes + 1])
        self.linear_meal = FeedForward([hidden_dim, hidden_dim * 4, num_meals + 1])
        self.linear_action = FeedForward([hidden_dim, hidden_dim * 4, num_actions + 1])

        # Output positional encodings (object queries)
        self.query_pos = nn.Parameter(torch.rand(num_queries, hidden_dim))
        
    def forward(self, x, mask):
        encoder_outputs, _ = self.bert_encoder(
            input_ids=x, attention_mask=mask
        )
        # Extract the last hidden state as encoder output
        encoder_hidden_states = encoder_outputs.permute(1, 0, 2)

        # Decoder input: object queries + positional encodings
        # Note: You might need to adjust this part depending on how you design the decoder inputs
        decoder_input = self.query_pos.unsqueeze(1).repeat(1, x.size(0), 1)

        # Transformer Decoder
        transformer_output = self.transformer_decoder(
            tgt=decoder_input,
            memory=encoder_hidden_states,
            memory_key_padding_mask=~mask.bool(),
        )

        # Project transformer outputs to class labels
        pred_class_logits = self.linear_class(
            self.classification_dropout(transformer_output.permute(1, 0, 2))
        )
        pred_meal_logits = self.linear_meal(
            self.meals_dropout(transformer_output.permute(1, 0, 2))
        )
        pred_action_logits = self.linear_action(
            self.actions_dropout(transformer_output.permute(1, 0, 2))
        )

        return pred_class_logits, pred_meal_logits, pred_action_logits
    

model = TextDETR(20,3,2).eval()

random_tokens = torch.randint(0, 30000, (1, 512))
mask = torch.ones(1,512)


traced_model = torch.jit.trace(model, (random_tokens, mask))


mlmodel = ct.convert(
    traced_model,
    inputs=[ct.TensorType(name="input_ids", shape=(1,ct.RangeDim(1, 512)), dtype=np.int32),\
        ct.TensorType(name="mask", shape=(1,ct.RangeDim(1, 512)), dtype=np.int32)]
)

The error I get now is ValueError: Cannot add const (is41, is36, 2, 768).

Again this model traces and scripts with no errors from the torch.jit functions. I can also convert it to ONNX and get a working model, so I'm assuming this is something on the CoreML side.

If you could shed some light on this error I'd be very grateful.

Feb 29 '24 12:02 kells1986

If I roll back to version 6.3.0 I get the error message: RuntimeError: PyTorch convert function for op 'unflatten' not implemented.

I don't see any calls to unflatten in the pytorch source or in my model though, and I've ruled out the call to unsqueeze as the issue

Feb 29 '24 22:02 kells1986

If I roll back to version 6.3.0 I get the error message: RuntimeError: PyTorch convert function for op 'unflatten' not implemented.

I don't see any calls to unflatten in the pytorch source or in my model though, and I've ruled out the call to unsqueeze as the issue

We lower PyTorch ops prior to conversion. One of your ops must be using unflatten once that op is lowered. I don't think using a previous version of coremltools makes sense here.

Ok, so once you apply my workaround, you're getting a new error with a different model. I can reproduce this too.

This problem looks totally separate from the original problem. We should probably have a separate GitHub issue for it.

This reshape is failing while trying to convert a (lowered) unflatten PyTorch ops. This looks related to your flexible shape usage. I would expect this to work a static shape.

Mar 01 '24 19:03 TobyRoseman

@TobyRoseman Thanks, changing shape=(1,ct.RangeDim(1, 512)) -> shape=(1, 512) works.

Interestingly I have to add compute_precision=ct.precision.FLOAT32 to the conversion call in order to get the same results with CoreML that I get with pytorch. I've never seen this before, usually Float16 works very well.

Do you have any suggestions as to why this might be the case? I suspect it's not a coreML issue and is more related to my training loop.

Mar 02 '24 21:03 kells1986

How much different are the values from the Float16 model? Are they completely wrong?

I think what might be happening here is that the neuralnetwork model type is being used for compute_precision=ct.precision.FLOAT32 and mlprogram model type used for Float16.

Instead of specifying the compute_precision parameter. Try using the convert_to parameter directly. Convert the model once with convert_to='mlprogram' and once with convert_to='neuralnetwork' Does the neuralnetwork output match but not the mlprogram output?

Mar 05 '24 01:03 TobyRoseman

The values aren't too far out, but just enough to make the predictions unreliable.

I found that setting the precision to FLOAT32 and then running this code:

import coremltools.optimize.coreml as cto

op_config = cto.OpLinearQuantizerConfig(mode="linear_symmetric", weight_threshold=512)
config = cto.OptimizationConfig(global_config=op_config)

compressed_8_bit_model = cto.linear_quantize_weights(mlmodel, config=config)

Resulted in an accurate 8-bit model.

I will try your suggestion today and report back

Mar 05 '24 08:03 kells1986