torch d2go model to coreml conversion - Unknown error in building network shapes
πDescribe the bug
Attempting to convert the pytorch based d2go maskrcnn model with FBNETV2C4Backbone (https://github.com/facebookresearch/d2go/blob/main/configs/mask_rcnn_fbnetv3a_C4.yaml) to coreml.
Conversion to MIL successful
Converting Frontend ==> MIL Ops: 100%|ββββββββββ| 1027/1029 [00:04<00:00, 217.11 ops/s]
with a number of torch ops needing to be added, and some with minor patches (I'll list these at the end).
However the model runs into an unknown error when compiling the model "compiler error: Unknown error in building network shapes."
I've noticed if I convert the model with the convert_to=mlprogram that the error changes to "compiler error: Encountered an error while compiling a neural network model: in operation of type cond: cond must return same types from both branches"
Stack Trace
Translating MIL ==> NeuralNetwork Ops: 100%|ββββββββββ| 23/23 [00:00<00:00, 12615.27 ops/s]
Translating MIL ==> NeuralNetwork Ops: 100%|ββββββββββ| 1030/1030 [00:13<00:00, 76.75 ops/s]
/conversion/lib/python3.9/site-packages/coremltools/models/model.py:137: RuntimeWarning: You will not be able to run predict() on this Core ML model. Underlying exception message was: Error compiling model: "compiler error: Unknown error in building network shapes.".
but changes to
RuntimeWarning: You will not be able to run predict() on this Core ML model. Underlying exception message was: Error compiling model: "compiler error: Encountered an error while compiling a neural network model: in operation of type cond: cond must return same types from both branches".
when using convert_to=mlprogram when converting the model.
To Reproduce
I've not been able to isolate the layer causing this issue in the translation of MIL to NN, but would be very happy to dig deeper and try and find a minimal example. IF there are some suggestions on how I might be able to do this that would be rerally appreciated.
I'm running a basic conversion script
import coremltools as ct
import os
import torch
import torchvision
from d2go.model_zoo import model_zoo
from detectron2.export import TracingAdapter
import maskrcnn.ops
def inference_func(tmodel, image):
inputs= [{"image": image}]
return tmodel.inference(inputs, do_postprocess=False)[0]
def main():
model = model_zoo.get('mask_rcnn_fbnetv3a_C4.yaml', trained=True)
example = torch.rand(3, 224, 224)
wrapper = TracingAdapter(model, example, inference_func)
wrapper.eval()
traced_model = torch.jit.trace(wrapper, (example,))
inputs = [ct.ImageType(name="my_input", shape=(3, 224, 224))]
coreml_model = ct.convert(
traced_model,
inputs=inputs
)
if __name__ == '__main__':
main()
System environment (please complete the following information):
- python dependencies
- python=3.9
- pytorch=1.9.1
- torchvision=0.10.1
- coremltools=5.2.0
- MacOS 12.3.1
Additional context
added torch ops
nms
repeat_interleave
numel
roi_align
logicaland
ops with minor patches
clamp
narrow
index- handle broadcasting indexs
max - inputs are len == 1
split - handle case when num_splits == 1
to - handle the case where the dtype is not set, this should be inferred from the Tensor dtype. see, https://pytorch.org/docs/stable/generated/torch.Tensor.to.html?highlight=#torch.Tensor.to
tupleunpack - handle case when len(output) == 1
Please send us a pull request for those new torch ops and patches. Those would be great to add.
Without having those changes, I can't reproduce your issue. So it's difficult for me to be helpful. I would check the conditional in your network. It sounds like the two branches are not producing the same output type.
Ok, I'll put the changes into a PR so we can work from the same ops.