coremltools EXC_BAD_ACCESS (code=1, address=0x0) with models with multiple LayerNorm layers

🐞Describing the bug

CoreML crashes in [MLNeuralNetworkEngine predictionFromFeatures:options:error:] with EXC_BAD_ACCESS (code=1, address=0x0) when the model contains multiple Conv1D layers each followed by a LayerNorm layer. Without the LayerNorm layers but with the same Conv1D configuration, CoreML runs well without crashing. The model was created using TensorFlow 2.6.2 on Python 3.9.12, and converted with coremltools 5.2.0. I'm uploading this report here because I'm not sure whether this is an issue of coremltools or an issue of CoreML.

Stack Trace

* thread #1, queue = 'com.apple.CoreMLBatchProcessingQueue', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x00000001bb52711c libsystem_platform.dylib`_platform_memmove + 76
    frame #1: 0x00000001d0589b58 Espresso`EspressoLight::espresso_plan::__copy_inputs(std::__1::shared_ptr<EspressoLight::plan_task_t>, std::__1::shared_ptr<Espresso::abstract_batch> const&, int, std::__1::shared_ptr<Espresso::net>) + 1436
    frame #2: 0x00000001d0588e2c Espresso`EspressoLight::espresso_plan::dispatch_task_on_compute_batch(std::__1::shared_ptr<Espresso::abstract_batch> const&, std::__1::shared_ptr<EspressoLight::plan_task_t> const&) + 504
    frame #3: 0x00000001d0593a24 Espresso`EspressoLight::espresso_plan::execute_sync() + 412
    frame #4: 0x00000001d0599388 Espresso`espresso_plan_execute_sync + 132
    frame #5: 0x00000001c316bf40 CoreML`-[MLNeuralNetworkEngine executePlan:error:] + 136
    frame #6: 0x00000001c316c5e8 CoreML`-[MLNeuralNetworkEngine evaluateInputs:bufferIndex:options:error:] + 728
    frame #7: 0x00000001c316ead4 CoreML`__54-[MLNeuralNetworkEngine evaluateInputs:options:error:]_block_invoke + 44
    frame #8: 0x00000001005f63a8 libdispatch.dylib`_dispatch_client_callout + 20
    frame #9: 0x000000010060ab94 libdispatch.dylib`_dispatch_lane_barrier_sync_invoke_and_complete + 192
    frame #10: 0x00000001c316e8f8 CoreML`-[MLNeuralNetworkEngine evaluateInputs:options:error:] + 376
    frame #11: 0x00000001c3163568 CoreML`__62-[MLNeuralNetworkEngine predictionFromFeatures:options:error:]_block_invoke + 128
    frame #12: 0x00000001005f63a8 libdispatch.dylib`_dispatch_client_callout + 20
    frame #13: 0x000000010060ab94 libdispatch.dylib`_dispatch_lane_barrier_sync_invoke_and_complete + 192
    frame #14: 0x00000001c31633dc CoreML`-[MLNeuralNetworkEngine predictionFromFeatures:options:error:] + 436
  * frame #15: 0x000000010000d958 MyCoreMLCmdApp`MyModel.prediction(input=0x00000001010af390, options=0x00000001010ac080, self=0x00000001010cb250) at MyModel.swift:224:37
    frame #16: 0x000000010000d878 MyCoreMLCmdApp`MyModel.prediction(input=0x00000001010af390, self=0x00000001010cb250) at MyModel.swift:209:25
    frame #17: 0x00000001000045c8 MyCoreMLCmdApp`main at main.swift:10:25
    frame #18: 0x000000010004108c dyld`start + 520

To Reproduce

# Python 3.9.12
# pip install tensorflow==2.6.2 coremltools==5.2.0 protobuf<=3.20

import coremltools as ct
import tensorflow as tf
import os

def make_model(conv_layer_definitions: list[int]) -> tf.keras.Model:
    input = tf.keras.layers.Input(shape=(4096, 2))
    output = input
    for conv_layer_definition in conv_layer_definitions:
        output = tf.keras.layers.Conv1D(conv_layer_definition, 8, 2, "same")(output)
        # Uncomment this line to reproduce the problem
        # output = tf.keras.layers.LayerNormalization()(output)
    model = tf.keras.Model(inputs=input, outputs=output)
    model.compile(optimizer="SGD", loss="binary_crossentropy")
    model.summary()
    return model

conv_layer_definitions_not_working = [
    [32, 32, 64, 64, 128, 128, 256],
    [32, 32, 64, 64, 64, 64],
    [32, 32, 64, 64, 64],
    [128, 128, 128, 128],
    [256, 256, 256],
    [512, 512, 512],
    # Edited: This works
    # [4096],
]

conv_layer_definitions_works_well = [
    [32, 32, 64, 64],
    [64, 64, 64, 64],
    [128, 128, 128],
    [256, 256],
    [512, 512],
    [4096],
]

# replace this with conv_layer_definitions_not_working[...] and the Swift program below will crash as reported above.
conv_layer_definitions = conv_layer_definitions_works_well[-1]

os.system("rm -rf MyModel.mlpackage")

converted_model: ct.models.MLModel = ct.convert(
    make_model(conv_layer_definitions), convert_to="mlprogram", source="tensorflow"
)
converted_model.save("MyModel.mlpackage")

import CoreML

let model = try! MyModel()

let inputData = [Float](repeating: 0.1, count: 4096 * 2)
let inputArray = MLShapedArray(scalars: inputData, shape: [1, 4096, 2])
let input = MyModelInput(input_1: inputArray)

// Crashes here
let output = try! model.prediction(input: input)

print("\(output.IdentityShapedArray.shape)")

Model training environment:

coremltools version: 5.2.0
OS: Ubuntu 18.04.6 LTS
Python version: 3.9.12
TensorFlow version: 2.6.2

Deployment target:

Xcode version: 13.4.1 (13F100)
OS: macOS 12.5

Aug 07 '22 11:08 paxbun

Seems like LayerNorm makes pretty verbose MIL code... maybe the malloc()-like function used in CoreML returns nullptr when the allocation fails?

Aug 07 '22 13:08 paxbun

I can't reproduce this issue using our latest beta release.

In order to get it to fail, I'm suppose to replace this line:

conv_layer_definitions = conv_layer_definitions_works_well[-1]

with this:

conv_layer_definitions = conv_layer_definitions_not_working[-1]

Is that right?

If that is correct, then please try installing our latest beta release (via pip install coremltools --pre) and see if that fixes the issue.

Also can you successfully get predictions from your converted model in Python?

Aug 08 '22 22:08 TobyRoseman

Seems like [4096] works (it didn't work when I initially tried), but [32, 32, 64, 64, 128, 128, 256] definitely does not work. Please try with conv_layer_definitions_not_working[0]. TensorFlow successfully gets predictions from my model, but CoreML crashes as shown below. I'll try with 6.0 as you said.

Changes in the Python script:

...
conv_layer_definitions = conv_layer_definitions_not_working[0]
...
converted_model: ct.models.MLModel = ct.convert(
    make_model(conv_layer_definitions), convert_to="mlprogram", source="tensorflow"
)
print("predicting...")
prediction_result = converted_model.predict({
    "input_1": np.zeros((1, 4096, 2), np.float32)
})
print(prediction_result["Identity"].shape)
converted_model.save("MyModel.mlpackage")

Execution result:

paxbun@PAXBUN-MAC conversion % cat main.py
...
conv_layer_definitions = conv_layer_definitions_not_working[0]
...
converted_model: ct.models.MLModel = ct.convert(
    make_model(conv_layer_definitions), convert_to="mlprogram", source="tensorflow"
)
print("predicting...")
prediction_result = converted_model.predict({
    "input_1": np.zeros((1, 4096, 2), np.float32)
})
print(prediction_result["Identity"].shape)
converted_model.save("MyModel.mlpackage")


paxbun@PAXBUN-MAC conversion % python main.py
...
Running TensorFlow Graph Passes: 100%|██████████████████████| 6/6 [00:00<00:00, 32.58 passes/s]
Converting Frontend ==> MIL Ops: 100%|████████████████████| 247/247 [00:00<00:00, 953.60 ops/s]
Running MIL Common passes: 100%|█████████████████████████| 34/34 [00:00<00:00, 206.78 passes/s]
Running MIL FP16ComputePrecision pass: 100%|████████████████| 1/1 [00:00<00:00,  3.54 passes/s]
Running MIL Clean up passes: 100%|██████████████████████████| 9/9 [00:00<00:00, 20.85 passes/s]
predicting...
zsh: segmentation fault  python main.py
/opt/homebrew/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '



paxbun@PAXBUN-MAC conversion % echo $?
130

Aug 09 '22 00:08 paxbun

coremltools==6.0b2 does not work as well:

(base) paxbun@PAXBUN-MAC conversion % conda create -n foo python=3.9
...

(base) paxbun@PAXBUN-MAC conversion % conda activate foo

(foo) paxbun@PAXBUN-MAC conversion % pip install tensorflow-macos==2.8.0 coremltools==6.0b2 "protobuf<=3.20" --pre
...
Installing collected packages: tf-estimator-nightly, termcolor, tensorboard-plugin-wit, mpmath, libclang, keras, flatbuffers, zipp, wrapt, urllib3, typing-extensions, tqdm, tensorboard-data-server, sympy, six, pyparsing, pyasn1, protobuf, oauthlib, numpy, MarkupSafe, idna, gast, charset-normalizer, cachetools, absl-py, werkzeug, rsa, requests, pyasn1-modules, packaging, opt-einsum, keras-preprocessing, importlib-metadata, h5py, grpcio, google-pasta, astunparse, requests-oauthlib, markdown, google-auth, coremltools, google-auth-oauthlib, tensorboard, tensorflow-macos
Successfully installed MarkupSafe-2.1.1 absl-py-1.2.0 astunparse-1.6.3 cachetools-5.2.0 charset-normalizer-2.1.0 coremltools-6.0b2 flatbuffers-2.0 gast-0.5.3 google-auth-2.10.0 google-auth-oauthlib-0.4.6 google-pasta-0.2.0 grpcio-1.48.0rc1 h5py-3.7.0 idna-3.3 importlib-metadata-4.12.0 keras-2.8.0 keras-preprocessing-1.1.2 libclang-14.0.6 markdown-3.4.1 mpmath-1.2.1 numpy-1.23.1 oauthlib-3.2.0 opt-einsum-3.3.0 packaging-21.3 protobuf-3.20.0 pyasn1-0.5.0rc1 pyasn1-modules-0.3.0rc1 pyparsing-3.0.9 requests-2.28.1 requests-oauthlib-1.3.1 rsa-4.9 six-1.16.0 sympy-1.10.1 tensorboard-2.8.0 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.1 tensorflow-macos-2.8.0 termcolor-1.1.0 tf-estimator-nightly-2.8.0.dev2021122109 tqdm-4.64.0 typing-extensions-4.3.0 urllib3-1.26.11 werkzeug-2.2.2 wrapt-1.14.1 zipp-3.8.1

(foo) paxbun@PAXBUN-MAC conversion % python main.py
...
predicting...
zsh: segmentation fault  python main.py
(foo) paxbun@PAXBUN-MAC conversion % /Users/paxbun/anaconda3/envs/foo/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

(foo) paxbun@PAXBUN-MAC conversion % echo $?
130

Aug 09 '22 00:08 paxbun

I still can not reproduce this issue. On macOS 12.3, I can get predictions from your Core ML model just fine.

If this was worked in macOS 12.3 but stopped work in 12.5, then it's an issue with the Core ML Framework.

Please report this issue here: https://developer.apple.com/bug-reporting/.

Aug 10 '22 17:08 TobyRoseman