coremltools icon indicating copy to clipboard operation
coremltools copied to clipboard

CoreML segfaults on torch.nn.Conv1d

Open metascroy opened this issue 5 months ago • 3 comments

🐞Describing the bug

CoreML segfaults when running torch.ops.aten.conv1d.default.

To Reproduce


import torch

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = torch.nn.Conv1d(16, 4, 6, stride=8, padding=0, dilation=2, groups=2, bias=False)    
    def forward(self, x):
        return self.conv(x)

model = Model()
inputs = (
    torch.randn(2, 16, 11),
)

eager_outputs = model(*inputs)

ep = torch.export.export(model, inputs)

import coremltools as ct
import numpy as np
ep = ep.run_decompositions({})

eager_outputs = model(*inputs)

mlmodel = ct.convert(ep)

coreml_inputs = mlmodel.get_spec().description.input
coreml_outputs = mlmodel.get_spec().description.output
predict_inputs = {str(ct_in.name): pt_in.detach().cpu().numpy().astype(np.int32) for ct_in, pt_in in zip(coreml_inputs, inputs)}
out = mlmodel.predict(predict_inputs)

print("Eager", eager_outputs)
print("CoremL", out)

This above code results in a segfault.

System environment (please complete the following information):

  • coremltools version: 8.3
  • OS (e.g. MacOS version or Linux type): macOS15

metascroy avatar Jul 30 '25 17:07 metascroy

Reproduced on coremltools 9.0b1, macos 15.6, torch 2.7.1. Also reproduced on simplified a conv: self.conv = torch.nn.Conv1d(16, 2, 3, dilation=2, groups=2) (changed to a more reasonable odd kernel size, no stride or padding involved) It can also be reproduced using torch.jit.trace instead of torch.export.export It requires both dilation and groups to segfault, removing just one no longer segfaults.

phhusson avatar Aug 03 '25 21:08 phhusson

@metascroy This is a Core ML framework crash and we are tracking it internally, the model should not crash on CPU_ONLY and CPU_AND_GPU compute units.

cymbalrush avatar Sep 09 '25 19:09 cymbalrush

Hey @cymbalrush,

I have been debugging this issue and found something that might help:

Crash Details:

Image
  • Error: Error: # of batch has to be 1 (FILE: ZinNEConvLayer.cpp, LINE: 1155)
  • Crash location: libsystem_platform.dylib`_platform_memmove + 144
  • The segfault occurs specifically in the ANE compiler when batch size > 1

Here's the backtrace.txt of the run.

Minimal Reproduction:

# With the previously provided python script
# and versions: coremltools==9.0b1, torch==2.6.0 (MacOS 15.5)
# Run the debugger
lldb python
# after activation
(lldb) run main.py # or any py file having the code to reproduce

It seems there's a bug in the ANE compiler's batch size validation code. It detects the unsupported configuration but then crashes during memory operations instead of falling back gracefully. This confirms why setting compute_units=CPU_ONLY works as avoids the ANE compiler path entirely.

Also, I am interested in helping resolve this issue. However, I understand that the issue is being handled internally - perhaps can I assist by testing the patches (or maybe provide more debugging info) or help with fixing if any parts of this is open-source?

noobsiecoder avatar Sep 12 '25 01:09 noobsiecoder