coremltools return NaN values when passing through specific convolution layer

Using coreML and coremltools, I want to implement Real-time Segmentation deep learning model on the iPhone. When I passed the segmentation model learned by Keras to CoreML for this, I found a bug where NaN occurs.

Device :

MacBook Pro (15-inch, 2017), MacOS High Sierra 10.13.6

Library :

keras==2.1.3
tensorflow==1.5.0

When passing from the MaxPooling layer to the Convolution layer in the segmentation model (U-net), the result of the operation is that some of output layer appears as NaN. Even each time I convert it repeatedly through coremltools, different layers are output as NaN. Please check the sample code and dataset for the below link. In the Keras model, NaN was not output at all.

link : https://github.com/sangjaekang/bugReportForCoreML/blob/master/Debugging%20Coreml.ipynb

Aug 17 '18 10:08 sangjaekang

Do you still get nans if you switch on the useCPUOnly flag? model.predict(input_dict, useCPUOnly=True)

Aug 17 '18 17:08 aseemw

Yes, Of Course. But Still get NaNs.

coreml_inputs = {"image":Image.fromarray((img*255).astype(np.uint8))}
output = coreml_model.predict(coreml_inputs,useCPUOnly=True)['output']

Aug 18 '18 14:08 sangjaekang

Since the input type of the CoreML model is "imageType" (print(coreml_model) will print the type), coremltools require a PIL image as input instead of an np.uint8 array.

from PIL import Image
img = Image.open("data/sample.png")
coreml_inputs = {"image":img}

Aug 19 '18 01:08 aseemw

I have found that when I insert a PIL image from another Keras model, the correct result is obtained. But this case return very weird result. Please check the link below. ErrorReport

Aug 19 '18 01:08 sangjaekang

I looked at the note book, to me the CoreML output looks identical to Keras's

img = Image.open("data/sample.png")
coreml_inputs = {"image":img}
cout = coreml_model.predict(coreml_inputs)['output']
print(cout.shape)
cout = np.transpose(cout, [1,2,0])
print(cout.shape, cout.dtype)
plt.imshow(cout[:,:,:3])
plt.show()

Aug 19 '18 17:08 aseemw

The problem I'm referring to is that when encoding the model, there is the problem of spewing NaN with a random probability.

The last cell result of the notebook is this. ''' 0 : input, <keras.engine.topology.InputLayer object at 0xd1b9f0dd8> 1 : conv0-conv1, <keras.layers.convolutional.Conv2D object at 0xd1b9f0e48> 2 : activation_1, <keras.layers.core.Activation object at 0xd1b9f0f60> 3 : conv0-conv2, <keras.layers.convolutional.Conv2D object at 0xd1b9ff0f0> 4 : activation_2, <keras.layers.core.Activation object at 0xd1b9ff470> 5 : conv0-deep_concat, <keras.layers.merge.Concatenate object at 0xd1b9ff4e0> 6 : conv0-pool, <keras.layers.pooling.MaxPooling2D object at 0xd1b9ff518> 7 : conv1-conv1, <keras.layers.convolutional.Conv2D object at 0xd1b9ff5c0>

24th layer have 16384 nans 25th layer have 16384 nans 26th layer have 16384 nans 27th layer have 16384 nans 28th layer have 16384 nans 29th layer have 16384 nans 30th layer have 16384 nans 31th layer have 16384 nans

total NaN layer : 8

0 : input, <keras.engine.topology.InputLayer object at 0xd1b9f0dd8> 1 : conv0-conv1, <keras.layers.convolutional.Conv2D object at 0xd1b9f0e48> 2 : activation_1, <keras.layers.core.Activation object at 0xd1b9f0f60> 3 : conv0-conv2, <keras.layers.convolutional.Conv2D object at 0xd1b9ff0f0> 4 : activation_2, <keras.layers.core.Activation object at 0xd1b9ff470> 5 : conv0-deep_concat, <keras.layers.merge.Concatenate object at 0xd1b9ff4e0> 6 : conv0-pool, <keras.layers.pooling.MaxPooling2D object at 0xd1b9ff518> 7 : conv1-conv1, <keras.layers.convolutional.Conv2D object at 0xd1b9ff5c0>

17th layer have 16384 nans 20th layer have 16384 nans '''

this means that when I encode a model into coreml, I found NaN results on particular layers with a random probability. This is similar to the NaN result when running on the iPhone.

Aug 20 '18 01:08 sangjaekang

In the notebook that you have linked I still see the incorrect line coreml_inputs = {"image":Image.fromarray((img*255).astype(np.uint8))}. As pointed above this should be changed to coreml_inputs = {"image":Image.open("data/sample.png")} (I've run the notebook after this change, I do not see any nans. I tried on macos10.14 beta)

And btw, the code does not actually loop over the layers of the models, since the converted model only exposes the output of the last layer of the keras model. I don't see why the model conversion call is within the for loop, nothing changes across iterations. The loop seems to iterate the first dimension of the coreml output which has shape (32,128,128). What is the point of that? You might as well do something much simpler without the loop:

coreml_model = coremltools.converters.keras.convert(
    model,
    input_names='image',
    image_input_names='image',
    output_names='output',
    image_scale=1/255.0,
)
coreml_inputs = {"image":Image.open("data/sample.png")}
output = coreml_model.predict(coreml_inputs,useCPUOnly=True)['output']
print(output.shape)
print("NAN Counts in coreml.predict :",np.isnan(output.flatten()).sum())

Aug 20 '18 01:08 aseemw

Thank you very much for your reply.

I don't see why the model conversion call is within the for loop, nothing changes across iterations. The loop seems to iterate the first dimension of the coreml output which has shape (32,128,128). What is the point of that?

On my Macbook(macOS high Sierra 10.13.6), every conversion has different results, so I put it in the loop syntax to show it.

When I run the above code, I get a different result. My result is this.

NAN Counts in coreml.predict : 131072

Since your got the right results, it was hard to see if this was a problem with my MacBook or with coremltools.

Aug 20 '18 04:08 sangjaekang

I see, my bad for not grasping the issue earlier. Definitely the issue seems related to the Mac OS version. I tried the script on Mac OS 10.13.2 and I am able to reproduce the inconsistent behavior of the CoreML predictions. Converting it multiple times gives different predictions with nans. It is weird since the hash of each of the .mlmodel files are identical, but still they give different predictions. Will update the thread if we find the source of the problem. Hopefully on newer builds the issue is not there.

Btw, do you see a similar behavior if you make an iOS app and run the model multiple times (in that case input would be a cvpixelbuffer)?

Aug 20 '18 17:08 aseemw

I've found repeated problems with mobile phone. The same problem occurred when simulated on iphone 6, iOS 11.4.1.

Aug 22 '18 01:08 sangjaekang

When do you think this issue will be update?

Aug 22 '18 11:08 sangjaekang

Hi @sangjaekang, it would be also great if you could file a bug report at https://developer.apple.com/bug-reporting/, with steps to produce the random predictions on an iOS app. Ideally you should attach the model and the app code that demonstrates the inconsistent behavior. Thanks! Initially I thought the issue might be on the Python side with some memory leaks. But if its happening on the app as well, this is something we need to look into in detail.

Aug 22 '18 18:08 aseemw

When do you think this issue will be update?

Was anyone able to resolve this? I am seeing NaN probabilities outputted for my CoreML activity classifier in watchOS 5 simulator on XCode 10.

Sep 25 '18 07:09 swupnil

Hi @swupnil , did you get a chance to file a bug report with code to reproduce the Nan issue on iOS 12.0 or watchOS 5.0. If you have please give us the bug report number and we'll track it internally.

Sep 25 '18 18:09 aseemw

i've been trying to solve this issue as well. my generator model outputs correctly with coremltools (in python). However, the model in iOS13 (iphone xs max device) outputs a bunch of NaN. My model has two inputs. When I set first input to all white or all black pixels, that model gives some non NaN result. However, most of other inputs gives NaN result. Even if I set usesCPUOnly to true, same NaN issue occurs.

Nov 15 '19 08:11 sebkim

I tested my model on iphone 11 pro and it works!! I guess some neural layer only works on apple A13.

Nov 18 '19 01:11 sebkim

@sebkim Hi. I'm facing a similar problem too. The performance and output of the same model are weird on XS and XS MAX, while the model works fine on other devices like IPHONE11 or IPHONE6. Did you find the reason of this problem?

Dec 06 '19 09:12 zoooo0820

@zoooo0820 I can't find the reason. I've reported this issue on apple feedback, but it's quite slow for them to respond.

Dec 06 '19 16:12 sebkim

@zoooo0820 Is it possible for you to share your model so that we can reproduce the issue at our end? Any reproducible toy model would also work.

Dec 06 '19 17:12 DawerG

https://drive.google.com/open?id=11oY6uguu9uXO7SiUV3Zt62YUwcug0MxJ

Here is the model with which we can reproduce.

Dec 08 '19 10:12 sebkim

@zoooo0820 Is it possible for you to share your model so that we can reproduce the issue at our end? Any reproducible toy model would also work.

Sorry for the late reply. I found the difference is caused by the presicion of GPU/CPU. Now I change the way doing inference from GPU to CPU and the result is fine. However, the results on different devices on GPU, using a same model and a same input, are still different.

Dec 09 '19 08:12 zoooo0820

after updating ios 13.3, coreml with usesCPUOnly true on iphone xs max works!

Dec 17 '19 07:12 sebkim

Multi-backend Keras support has been removed in coremltools 6.0.

Sep 20 '22 17:09 TobyRoseman