[PyTorch Conversion] SmolVLM model fails due to unsupported 'unfold' op in Core ML
🧠 Summary I'm attempting to convert a HuggingFace multi-modal model (SmolVLM-256M-Instruct) to Core ML format using coremltools.convert() from PyTorch. The conversion fails due to the use of the unfold operation, which is currently unsupported in Core ML's MIL backend.
💻 Environment macOS: 14.0 (Sonoma) — internal version 26.x
Device: Apple Silicon (M1/M2)
Python: 3.10
coremltools: 8.0.0
torch: 2.1.0
transformers: 4.34.0
Model: SmolVLM-256M-Instruct (downloaded locally)
📦 Conversion Code
import torch
import coremltools as ct
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
class SmolVLMWrapper(torch.nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, pixel_values, input_ids):
return self.model(pixel_values=pixel_values, input_ids=input_ids).logits
model = AutoModelForVision2Seq.from_pretrained("path/to/local/model", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("path/to/local/model", trust_remote_code=True)
wrapped_model = SmolVLMWrapper(model).eval()
dummy_image = Image.new('RGB', (224, 224))
dummy_text = "<image>\ndescribe this image"
inputs = processor(text=dummy_text, images=dummy_image, return_tensors="pt")
example_input = (inputs['pixel_values'], inputs['input_ids'])
traced_model = torch.jit.trace(wrapped_model, example_input)
coreml_model = ct.convert(
model=traced_model,
source="pytorch",
inputs=[
ct.TensorType(name="pixel_values", shape=example_input[0].shape),
ct.TensorType(name="input_ids", shape=example_input[1].shape)
],
convert_to="mlprogram",
compute_units=ct.ComputeUnit.ALL,
minimum_deployment_target=ct.target.iOS16,
debug=True
)
❌ Error Message
ERROR - converting 'unfold' op (located at: 'model/model/patches_subgrid.1'):
PyTorch convert function for op 'unfold' not implemented.
Also observed:
Code Core ML embedding (gather) layer does not support any inputs besides the weights and indices. Those given will be ignored. 📌 Notes The model uses unfold internally for patch extraction in the vision encoder.
The conversion fails early in the MIL graph construction phase.
I’ve confirmed the traced model returns static logits and does not contain dynamic control flow.
I’m not using TensorFlow/Keras in this environment.
🙏 Feature Request Please consider adding support for the unfold operation in Core ML’s PyTorch conversion path. This op is commonly used in vision models for patch embedding and is increasingly relevant for lightweight multi-modal architectures.
Alternatively, if there’s a recommended workaround or rewrite pattern for unfold, I’d be happy to adapt the model.
Thanks for your work on Core ML — it’s a critical tool for bringing advanced AI models to Apple platforms!
We do have support for im2col:
https://github.com/apple/coremltools/blob/ea1d2deffd52f18e75962e2e600a4c29c1bab2f5/coremltools/converters/mil/frontend/torch/ops.py#L8288
This seems to be a less general form of torch.unfold:
https://github.com/apple/coremltools/blob/ea1d2deffd52f18e75962e2e600a4c29c1bab2f5/coremltools/converters/mil/frontend/torch/ops.py#L8292
import torch
import coremltools as ct
# A minimal PyTorch model that uses the 'unfold' operation
class UnfoldModel(torch.nn.Module):
def forward(self, x):
# Simulate image patch extraction using sliding windows
# This mimics the behavior of patch embedding in vision models
return x.unfold(2, 3, 1).unfold(3, 3, 1)
# Instantiate the model and set to evaluation mode
model = UnfoldModel().eval()
# Create dummy input tensor (e.g., a small image with shape [1, 3, 8, 8])
dummy_input = torch.randn(1, 3, 8, 8)
# Trace the model using TorchScript
traced_model = torch.jit.trace(model, dummy_input)
# Attempt to convert the traced model to Core ML format
coreml_model = ct.convert(
model=traced_model,
source="pytorch",
inputs=[ct.TensorType(shape=dummy_input.shape)],
convert_to="mlprogram",
minimum_deployment_target=ct.target.iOS16,
debug=True # Enable debug mode to capture detailed logs
)
Thanks for the minimal example. I can reproduce the issue with it.