coremltools icon indicating copy to clipboard operation
coremltools copied to clipboard

[PyTorch Conversion] SmolVLM model fails due to unsupported 'unfold' op in Core ML

Open jefferyby opened this issue 2 months ago • 3 comments

🧠 Summary I'm attempting to convert a HuggingFace multi-modal model (SmolVLM-256M-Instruct) to Core ML format using coremltools.convert() from PyTorch. The conversion fails due to the use of the unfold operation, which is currently unsupported in Core ML's MIL backend.

💻 Environment macOS: 14.0 (Sonoma) — internal version 26.x

Device: Apple Silicon (M1/M2)

Python: 3.10

coremltools: 8.0.0

torch: 2.1.0

transformers: 4.34.0

Model: SmolVLM-256M-Instruct (downloaded locally)

📦 Conversion Code

import torch
import coremltools as ct
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image

class SmolVLMWrapper(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, pixel_values, input_ids):
        return self.model(pixel_values=pixel_values, input_ids=input_ids).logits

model = AutoModelForVision2Seq.from_pretrained("path/to/local/model", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("path/to/local/model", trust_remote_code=True)
wrapped_model = SmolVLMWrapper(model).eval()

dummy_image = Image.new('RGB', (224, 224))
dummy_text = "<image>\ndescribe this image"
inputs = processor(text=dummy_text, images=dummy_image, return_tensors="pt")
example_input = (inputs['pixel_values'], inputs['input_ids'])

traced_model = torch.jit.trace(wrapped_model, example_input)

coreml_model = ct.convert(
    model=traced_model,
    source="pytorch",
    inputs=[
        ct.TensorType(name="pixel_values", shape=example_input[0].shape),
        ct.TensorType(name="input_ids", shape=example_input[1].shape)
    ],
    convert_to="mlprogram",
    compute_units=ct.ComputeUnit.ALL,
    minimum_deployment_target=ct.target.iOS16,
    debug=True
)

❌ Error Message

ERROR - converting 'unfold' op (located at: 'model/model/patches_subgrid.1'):
PyTorch convert function for op 'unfold' not implemented.

Also observed:

Code Core ML embedding (gather) layer does not support any inputs besides the weights and indices. Those given will be ignored. 📌 Notes The model uses unfold internally for patch extraction in the vision encoder.

The conversion fails early in the MIL graph construction phase.

I’ve confirmed the traced model returns static logits and does not contain dynamic control flow.

I’m not using TensorFlow/Keras in this environment.

🙏 Feature Request Please consider adding support for the unfold operation in Core ML’s PyTorch conversion path. This op is commonly used in vision models for patch embedding and is increasingly relevant for lightweight multi-modal architectures.

Alternatively, if there’s a recommended workaround or rewrite pattern for unfold, I’d be happy to adapt the model.

Thanks for your work on Core ML — it’s a critical tool for bringing advanced AI models to Apple platforms!

jefferyby avatar Sep 30 '25 15:09 jefferyby

We do have support for im2col: https://github.com/apple/coremltools/blob/ea1d2deffd52f18e75962e2e600a4c29c1bab2f5/coremltools/converters/mil/frontend/torch/ops.py#L8288

This seems to be a less general form of torch.unfold: https://github.com/apple/coremltools/blob/ea1d2deffd52f18e75962e2e600a4c29c1bab2f5/coremltools/converters/mil/frontend/torch/ops.py#L8292

TobyRoseman avatar Sep 30 '25 16:09 TobyRoseman

import torch
import coremltools as ct

# A minimal PyTorch model that uses the 'unfold' operation
class UnfoldModel(torch.nn.Module):
    def forward(self, x):
        # Simulate image patch extraction using sliding windows
        # This mimics the behavior of patch embedding in vision models
        return x.unfold(2, 3, 1).unfold(3, 3, 1)

# Instantiate the model and set to evaluation mode
model = UnfoldModel().eval()

# Create dummy input tensor (e.g., a small image with shape [1, 3, 8, 8])
dummy_input = torch.randn(1, 3, 8, 8)

# Trace the model using TorchScript
traced_model = torch.jit.trace(model, dummy_input)

# Attempt to convert the traced model to Core ML format
coreml_model = ct.convert(
    model=traced_model,
    source="pytorch",
    inputs=[ct.TensorType(shape=dummy_input.shape)],
    convert_to="mlprogram",
    minimum_deployment_target=ct.target.iOS16,
    debug=True  # Enable debug mode to capture detailed logs
)

jefferyby avatar Oct 02 '25 04:10 jefferyby

Thanks for the minimal example. I can reproduce the issue with it.

TobyRoseman avatar Oct 02 '25 17:10 TobyRoseman