Race condition writing and loading MLModel in Coremltools 8.1
When having a python debugger attached, I have started to see a race condition when loading the converted MLModel. I have not seen this happen without having a debugger attached, and I have not hit this prior to the 8.1 release.
Specifically it happens in the load_spec function in coremltools/models/utils.py (notice that I modified the code to try again in a loop):
Looking in my file system, I indeed see that the file isn't there:
However, if I run one iteration of the loop (i.e. execute the loading code, and free the Python GIL from the thread), the Manifest.json files gets created:
Notice the timestamp for Manifest.json is 5 minutes later. The next time the loop executes, the model is usually loaded.
I did see some issues that If I step through the loop I added using a debugger, I get the following error:
an integer is required
File "/Users/knielsen/Library/Application Support/hatch/env/virtual/stablehlo-coreml-experimental/Ux-esJH6/test.py3.12/lib/python3.12/site-packages/_pydevd_sys_monitoring\\_pydevd_sys_monitoring_cython.pyx", line 1367, in _pydevd_sys_monitoring_cython._jump_event
File "<stringsource>", line 69, in cfunc.to_py.__Pyx_CFunc_7f6725__29_pydevd_sys_monitoring_cython_object__lParen__etc_to_py_4code_11from_offset_9to_offset.wrap
File "/Users/knielsen/Library/Application Support/hatch/env/virtual/stablehlo-coreml-experimental/Ux-esJH6/test.py3.12/lib/python3.12/site-packages/coremltools/models/utils.py", line 256, in load_spec
try:
File "/Users/knielsen/Library/Application Support/hatch/env/virtual/stablehlo-coreml-experimental/Ux-esJH6/test.py3.12/lib/python3.12/site-packages/coremltools/models/model.py", line 531, in _get_proxy_and_spec
specification = _load_spec(filename)
^^^^^^^^^^^^^^^^^^^^
File "/Users/knielsen/Library/Application Support/hatch/env/virtual/stablehlo-coreml-experimental/Ux-esJH6/test.py3.12/lib/python3.12/site-packages/coremltools/models/model.py", line 469, in __init__
self.__proxy__, self._spec, self._framework_error = self._get_proxy_and_spec(
Program to reproduce:
Run the following code with a Python debugger attached using coremltools 8.1:
import coremltools as ct
import numpy as np
from coremltools.converters.mil import Builder as mb
@mb.program(input_specs=[
mb.TensorSpec(shape=(2, 3, 4, 5)),
mb.TensorSpec(shape=(2, 4, 3, 5)),
])
def mil_program(arg0, arg1):
arg0_reshaped = mb.reshape(x=arg0, shape=(1, 120))
arg1_reshaped = mb.reshape(x=arg1, shape=(1, 120))
result = mb.matmul(x=arg0_reshaped, y=arg1_reshaped, transpose_x=False, transpose_y=True)
result = mb.reshape(x=result, shape=(1,))
return result
cml_model = ct.convert(
mil_program,
source="milinternal",
minimum_deployment_target=ct.target.iOS18,
)
inputs = {
"arg0": np.random.normal(0.0, 1.0, (2, 3, 4, 5)),
"arg1": np.random.normal(0.0, 1.0, (2, 4, 3, 5)),
}
predictions = cml_model.predict(inputs)
print(predictions)
I briefly attempted to find the bug myself, but the diff of the 8.1 release (https://github.com/apple/coremltools/pull/2394) is humongous, making it require more effort than I want to spend.
@cymbalrush you might will have more context on this issue ^
Investigating