coremltools Conversion error for a model with the `transpose` operation

🐞Describing the bug

When converting the model containing the transpose operation, an error may occur (ValueError: axes don't match array) - resulting in the conversion not being performed.

Stack Trace

Click to expand complete stack trace!

Traceback (most recent call last):
File "test_coreml_2.py", line 18, in <module>
  mlmodel = ct.convert(model, inputs=[ct.TensorType(shape=(1, 1, 2))])
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/_converters_entry.py", line 363, in convert
  debug=debug,
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/converter.py", line 183, in mil_convert
  return _mil_convert(model, convert_from, convert_to, ConverterRegistry, MLModel, compute_units, **kwargs)
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/converter.py", line 215, in _mil_convert
  **kwargs
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/converter.py", line 273, in mil_convert_to_proto
  prog = frontend_converter(model, **kwargs)
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/converter.py", line 95, in __call__
  return tf2_loader.load()
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/frontend/tensorflow/load.py", line 84, in load
  program = self._program_from_tf_ssa()
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/frontend/tensorflow2/load.py", line 200, in _program_from_tf_ssa
  return converter.convert()
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/frontend/tensorflow/converter.py", line 401, in convert
  self.convert_main_graph(prog, graph)
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/frontend/tensorflow/converter.py", line 330, in convert_main_graph
  outputs = convert_graph(self.context, graph, self.outputs)
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/frontend/tensorflow/convert_utils.py", line 189, in convert_graph
  add_op(context, node)
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/frontend/tensorflow/ops.py", line 1842, in Transpose
  x = mb.transpose(x=x, perm=perm, name=node.name)
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/mil/ops/registry.py", line 63, in add_op
  return cls._add_op(op_cls, **kwargs)
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/mil/builder.py", line 191, in _add_op
  new_op.type_value_inference()
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/mil/operation.py", line 243, in type_value_inference
  output_vals = self._auto_val(output_types)
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/mil/operation.py", line 330, in _auto_val
  vals = self.value_inference()
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/mil/operation.py", line 109, in wrapper
  return func(self)
File "/coremltools_venv/lib/python3.7/site-packages/coremltools/converters/mil/mil/ops/defs/tensor_transformation.py", line 886, in value_inference
  return np.transpose(self.x.val, axes=self.perm.val)
File "<__array_function__ internals>", line 6, in transpose
File "/coremltools_venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 660, in transpose
  return _wrapfunc(a, 'transpose', axes)
File "/coremltools_venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 54, in _wrapfunc
  return _wrapit(obj, method, *args, **kwds)
File "/coremltools_venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 43, in _wrapit
  result = getattr(asarray(obj), method)(*args, **kwds)
ValueError: axes don't match array

To Reproduce

Sample code with the minimal model causing this error:

import tensorflow as tf
import coremltools as ct

class CustomTranspose(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(CustomTranspose, self).__init__(**kwargs)

    def call(self, inputs):
        # inputs shape should be: (B,1,2)
        mat_concat = tf.concat([inputs, inputs], axis=1)  # [B,2,2]
        mat_trans = tf.transpose(mat_concat, perm=[0, 2, 1])  # [B,2,2]
        return mat_trans

inputs = tf.keras.Input(shape=(1, 2))
outputs = CustomTranspose()(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs)

mlmodel = ct.convert(model, inputs=[ct.TensorType(shape=(1, 1, 2))])

System environment (please complete the following information):

coremltools version: 5.2, 6.0b1 (all versions are affected)
OS: Tested on Ubuntu 18 (but all platforms are affected)
Any other relevant version information: Tested on Python 3.7 and TensorFlow 2.6/2.9 (but all platforms are affected)

Additional context

Debugging results

I debugged the code and I think I found a problem. The problem arises from the implementation of the value_inference function:

    @precondition(allow=VALUE | SYMBOL)
    def value_inference(self):
        return np.transpose(self.x.val, axes=self.perm.val)

For transpose operation, it has the decorator @precondition(allow=VALUE | SYMBOL), but in the code it refers directly to self.x.val - without any check for None/symbolic value. As a result, in the case where the input is symbolic - np.transpose is called on None – which causes the error described in this issue.

Potential fix

I am not very familiar with the coremltools implementation, but it seems to me that the correct implementation should look like:

    @precondition(allow=VALUE | SYMBOL)
    def value_inference(self):
        if self.perm.val is None:
            # only allow x to be symbolic. perm cannot.
            return None
        return np.transpose(self.x.sym_val, axes=self.perm.val)

Another option to get rid of this error might be to remove the allow=SYMBOL from the decorator:

    @precondition(allow=VALUE)
    def value_inference(self):
        return np.transpose(self.x.val, axes=self.perm.val)

Or possibly adding a check for None:

    @precondition(allow=VALUE | SYMBOL)
    def value_inference(self):
        if self.x.val is None:
            return None
        if self.perm.val is None:
            return None
        return np.transpose(self.x.val, axes=self.perm.val)

If any of the maintaners agree with my interpretation and with one of the solutions above - I can prepare a PR with this change (but I'd like to discuss the topic first and make sure my solution is correct).

Other operations affected by a similar problem?

I took a look at the value_inference implementations of other operations - and it seems to me that a similar problem can also exist with the flatten2d operation. It also has allow=SYMBOL in the precondition decorator and in the implementation it refers directly to self.x.val. I think it's worth checking out.

Jul 25 '22 15:07 andrusza2

Thanks for the detailed write up. I agree this is a problem.

Do any of your proposed solutions allow you to not only convert the model but also get predictions from mlmodel?

Jul 26 '22 22:07 TobyRoseman

Sure, of course. I came across this error by converting a larger model with an image on the input. What is in the "To reproduce" section is just a minimal example that causes the same problem.

After applying any of the fixes that I proposed - my model converts and works. The predictions from the generated mlmodel correspond to the predictions from the model in tensorflow before conversion.

In fact, the value_inference method is optional (according to the documentation: Optional Python implementation of the op) and has no direct effect on the generated mlmodel. An implementation bug caused the conversion to crash in specific cases (when input val is symbolic).

A solution with return None or with allow = VALUE only is essentially the same as not implementing this optional method for symbolic input val. That's why I like the first solution that will return the correct result also for symbolic input.

Jul 27 '22 07:07 andrusza2

Hi @andrusza2 - I have discussed this issue with my team.

We think the best fix is to change:

@precondition(allow=VALUE | SYMBOL)
def value_inference(self):

To:

@precondition(allow=VALUE)
def value_inference(self):

Could you put up a pull request for that change? Also, please add your reproduction example as a unit test.

Jul 27 '22 18:07 TobyRoseman

Regarding this potentially also being an issue with flatten2d: I think it should be fine. I think we should always know the shape and axis values. In which case, it shouldn't be and issue.

Jul 27 '22 18:07 TobyRoseman

Hi @TobyRoseman , I made a PR (#1563) with this change, take a look please.

Just out of curiosity - why is the version with disallowing symbolic values preferred?

When it comes to flatten2d - shape and axis shouldn't be a problem. But what worries me is referencing directly self.x.val in the return statement:

return self.x.val.reshape(dim_pre_axis, dim_post_axis)

It still seems to me that theoretically the same problem with symbolic value could occur as with the transpose (self.x.value being None in that case). Theoretically - because I think faltten2d is not used for any conversion at the moment (therefore I cannot give a working example of the problem). Please take another look if you can 🙂

Jul 28 '22 12:07 andrusza2

Fixed by #1563.

Oct 26 '22 20:10 TobyRoseman