alibi icon indicating copy to clipboard operation
alibi copied to clipboard

Error while loading the prediction function in AnchorImage

Open aniketzz opened this issue 2 years ago • 33 comments

I have a TensorFlow based Onnx classifier model. It's a binary classifier with the prediction function as "model.classify('image')". But I am getting the following errors while passing it to Alibi Explainer AnchorImage.


`ERROR:root:Error reading [[[[0. 0. 0.]
   [0. 0. 0.]
   [0. 0. 0.]
   ...
   [0. 0. 0.]
   [0. 0. 0.]
   [0. 0. 0.]]

  [[0. 0. 0.]
   [0. 0. 0.]
   [0. 0. 0.]
   ...
   [0. 0. 0.]
   [0. 0. 0.]
   [0. 0. 0.]]

  [[0. 0. 0.]
   [0. 0. 0.]
   [0. 0. 0.]
   ...
   [0. 0. 0.]
   [0. 0. 0.]
   [0. 0. 0.]]

  ...

  [[0. 0. 0.]
   [0. 0. 0.]
   [0. 0. 0.]
   ...
   [0. 0. 0.]
   [0. 0. 0.]
   [0. 0. 0.]]

  [[0. 0. 0.]
   [0. 0. 0.]
   [0. 0. 0.]
   ...
   [0. 0. 0.]
   [0. 0. 0.]
   [0. 0. 0.]]

  [[0. 0. 0.]
   [0. 0. 0.]
   [0. 0. 0.]
   ...
   [0. 0. 0.]
   [0. 0. 0.]
   [0. 0. 0.]]]] OpenCV(4.5.3) /tmp/pip-req-build-fvfwe_ss/opencv/modules/imgproc/src/color.simd_helpers.hpp:94: error: (-2:Unspecified error) in function 'cv::impl::{anonymous}::CvtHelper<VScn, VDcn, VDepth, sizePolicy>::CvtHelper(cv::InputArray, cv::OutputArray, int) [with VScn = cv::impl::{anonymous}::Set<1>; VDcn = cv::impl::{anonymous}::Set<3, 4>; VDepth = cv::impl::{anonymous}::Set<0, 2, 5>; cv::impl::{anonymous}::SizePolicy sizePolicy = cv::impl::<unnamed>::NONE; cv::InputArray = const cv::_InputArray&; cv::OutputArray = const cv::_OutputArray&]'
> Unsupported depth of input image:
>     'VDepth::contains(depth)'
> where
>     'depth' is 6 (CV_64F)
Traceback (most recent call last):
  File "image_utils.py", line 135, in load_images
    image = load_img(img_path, target_size=image_size)
  File "image_utils.py", line 58, in load_img
    path = cv2.cvtColor(path, cv2.COLOR_BGR2RGB)
cv2.error: OpenCV(4.5.3) /tmp/pip-req-build-fvfwe_ss/opencv/modules/imgproc/src/color.simd_helpers.hpp:94: error: (-2:Unspecified error) in function 'cv::impl::{anonymous}::CvtHelper<VScn, VDcn, VDepth, sizePolicy>::CvtHelper(cv::InputArray, cv::OutputArUnsupported depth of input image:ray, int) [with VScn = cv::impl::{anonymous}::Set<1>; VDcn = cv::impl::{anonymous}::Set<3, 4>; VDepth = cv::impl::{anonymous}::Set<0, 2, 5>; cv::impl::{anonymous}::SizePolicy sizePolicy = cv::impl::<unnamed>::NONE; cv::InputArray = const cv::_InputArray&; cv::OutputArray = const cv::_OutputArray&]'
> Unsupported depth of input image:
>     'VDepth::contains(depth)'
> where
>     'depth' is 6 (CV_64F)

Traceback (most recent call last):
  File "classifier.py", line 178, in <module>
    explainer = AnchorImage(predict_fn, image_shape, segmentation_fn=segmentation_fn, segmentation_kwargs=kwargs, images_background=None)
  File "/home/tg/anaconda3/envs/alibi/lib/python3.7/site-packages/alibi/explainers/anchor_image.py", line 348, in __init__
    self.predictor = self._transform_predictor(predictor)
  File "/home/tg/anaconda3/envs/alibi/lib/python3.7/site-packages/alibi/explainers/anchor_image.py", line 605, in _transform_predictor
    if np.argmax(predictor(np.zeros((1,) + self.image_shape)).shape) == 0:
AttributeError: 'dict' object has no attribute 'shape'`

However, I am able to get output from the prediction function. [[ 0.02799413353204727, 0.9720058441162109]]

aniketzz avatar Oct 27 '21 07:10 aniketzz

Hmm, the AttributeError suggests that the output of the prediction function is a dict instead of a numpy array, is that the case?

jklaise avatar Oct 27 '21 08:10 jklaise

The output of the prediction function is <class 'numpy.ndarray'>. I am not sure how the explainer is getting a dict or a list.

Here is the prediction function:

def classify( self, image_paths=[]):
        batch_size=4
        image_size=(256, 256)
        categories=["unsafe", "safe"]
       
        if not isinstance(image_paths, list):
            image_paths = [image_paths]

        loaded_images, loaded_image_paths = load_images(
            image_paths, image_size, image_names=image_paths
        )

        if not loaded_image_paths:
            return []

        preds = []
        model_preds = []
        while len(loaded_images):
            _model_preds = self.nsfw_model.run(
                [self.nsfw_model.get_outputs()[0].name],
                {self.nsfw_model.get_inputs()[0].name: loaded_images[:batch_size]},
            )[0]
            model_preds.append(_model_preds)
            preds += np.argsort(_model_preds, axis=1).tolist()
            loaded_images = loaded_images[batch_size:]

        probs = numpy.array([])
        for i, single_preds in enumerate(preds):
            single_probs = np.array([])
            for j, pred in enumerate(single_preds):
                single_probs= np.append(single_probs,
                    model_preds[int(i / batch_size)][int(i % batch_size)][pred]
                )
                preds[i][j] = categories[pred]
            # print(single_probs)
            probs= np.append(probs,single_probs)
        # print(probs)
        return probs

aniketzz avatar Oct 27 '21 08:10 aniketzz

I see, the input of the prediction function should also be a numpy array corresponding to a batch of images (see docs). Because we try to call the prediction function with an np.zeros instance this is probably failing as it's currently expecting a list of image paths instead. I would suggest refactoring so that the prediction function works on already loaded images as numpy arrays.

jklaise avatar Oct 27 '21 08:10 jklaise

I changes the input image to a numpy array and change the prediction function accordingly. Here is what I got after passing the prediction function to alibi.

===========================================================
**Model input and output:**

INPUT IMAGE =  (1, 256, 256, 3)
OUTPUT:  <class 'numpy.ndarray'>
RESULT:  [[0.9703339  0.02966616]]
===========================================================
**Alibi Explainer:** 

Traceback (most recent call last):
  File "classifier.py", line 148, in <module>
    explainer = AnchorImage(predict_fn, image_shape, segmentation_fn=segmentation_fn, segmentation_kwargs=kwargs, images_background=None)
  File "/home/tg/anaconda3/envs/alibi/lib/python3.7/site-packages/alibi/explainers/anchor_image.py", line 348, in __init__
    self.predictor = self._transform_predictor(predictor)
  File "/home/tg/anaconda3/envs/alibi/lib/python3.7/site-packages/alibi/explainers/anchor_image.py", line 605, in _transform_predictor
    if np.argmax(predictor(np.zeros((1,) + self.image_shape)).shape) == 0:
  File "classifier.py", line 142, in <lambda>
    predict_fn = lambda x: m.classify(x)
  File "classifier.py", line 105, in classify
    {self.nsfw_model.get_inputs()[0].name: loaded_images[:batch_size]},
  File "/home/tg/anaconda3/envs/alibi/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(double)) , expected: (tensor(float))

aniketzz avatar Oct 27 '21 10:10 aniketzz

Ah, this is an issue we recently fixed in master (see #506). Your model expects float32 numbers whilst we try to pass in float64 (aka double) numbers. In the development version we expose another kwarg dtype which is set to np.float32 by default. You would have to install the development version of alibi and it should work without any changes:

pip install git+https://github.com/SeldonIO/alibi.git 

jklaise avatar Oct 27 '21 10:10 jklaise

``awesome! It worked thanks.

But now when calling explainer.explain(). I am getting error as boolean index did not match indexed array along dimension 0; dimension is 100 but corresponding boolean dimension is 4

CODE: `images= load_img('a.jpeg', target_size=image_size)

images = img_to_array(images)

images /= 255

images=np.asarray(images)

print(images.shape)

print(type(images))

explanation = explainer.explain(images, threshold=.95, p_sample=.8, tau=0.50)

print(explanation.anchor)`

OUTPUT:

AnchorImage(meta={
  'name': 'AnchorImage',
  'type': ['blackbox'],
  'explanations': ['local'],
  'params': {
              'custom_segmentation': False,
              'segmentation_kwargs': {
                                       'n_segments': 15,
                                       'compactness': 20,
                                       'sigma': 0.5}
                                     ,
              'p_sample': 0.5,
              'seed': None,
              'image_shape': (256, 256, 3),
              'images_background': None,
              'segmentation_fn': 'slic'}
            ,
  'version': '0.6.2dev'}
)
(256, 256, 3)
<class 'numpy.ndarray'>
skimage.measure.label's indexing starts from 0. In future version it will start from 1. To disable this warning, explicitely set the `start_label` parameter to 1.
Traceback (most recent call last):
  File "classifier.py", line 157, in <module>
    explanation = explainer.explain(images, threshold=.95, p_sample=.8, tau=0.50)
  File "/home/tg/anaconda3/envs/alibi/lib/python3.7/site-packages/alibi/explainers/anchor_image.py", line 527, in explain
    **kwargs,
  File "/home/tg/anaconda3/envs/alibi/lib/python3.7/site-packages/alibi/explainers/anchor_base.py", line 667, in anchor_beam
    (pos,), (total,) = self.draw_samples([()], min_samples_start)
  File "/home/tg/anaconda3/envs/alibi/lib/python3.7/site-packages/alibi/explainers/anchor_base.py", line 356, in draw_samples
    for i, anchor in enumerate(anchors)]
  File "/home/tg/anaconda3/envs/alibi/lib/python3.7/site-packages/alibi/explainers/anchor_base.py", line 356, in <listcomp>
    for i, anchor in enumerate(anchors)]
  File "/home/tg/anaconda3/envs/alibi/lib/python3.7/site-packages/alibi/explainers/anchor_image.py", line 129, in __call__
    covered_true = raw_data[labels][: self.n_covered_ex]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 100 but corresponding boolean dimension is 4

aniketzz avatar Oct 27 '21 11:10 aniketzz

Yikes, that looks like a possible bug in the implementation, I don't know exactly what but something went wrong when getting labels as that is supposed to be the same size as raw_data (100 here) but is only of length 4 apparently.

Just to clarify, is images here a single image (of shape (256, 256, 3)) ?

Could you check the output of the following please where image is the image to be explained and images is a batch of some images (can be random)?:

instance_label = explainer.predictor(image[np.newaxis, ...])[0]  # should be an integer denoting the predicted class
labels = explainer.predictor(images) == instance_label  # should be a boolean array of the same length as the batch size of `images`

jklaise avatar Oct 27 '21 11:10 jklaise

Yes, the image shape is (256,256,3) with batch size 1.

while running instance_label = explainer.predictor(image[np.newaxis, ...])[0] command with input (1,256,256,3) I got onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Invalid rank for input: input_1:0 Got: 5 Expected: 4 Please fix either the inputs or the model.

And if I am sending (256,256,3) as image input then I got the previous error.

aniketzz avatar Oct 27 '21 11:10 aniketzz

@aniketzz the input to explainer.predictor should be the single image without batch dimension (what you would pass to explain, i.e. of shape (256,256,3). Which previous error do you get when you run this through explainer.predictor ?

jklaise avatar Oct 27 '21 14:10 jklaise

For the below piece of code:

  images= load_img('a.jpeg', target_size=image_size)

  images = img_to_array(images)

  images /= 255

  images=np.asarray(images)

  print("image: ",images.shape)

  print("image type: ",type(images))

  instance_label = explainer.predictor(images[np.newaxis, ...])[0]

  print("instance_label: ",instance_label)

  images = np.expand_dims(images, axis=0)

  labels = explainer.predictor(images) == instance_label

  print("labels: ",labels)

output is :

`image:  (256, 256, 3)

image type:  <class 'numpy.ndarray'>

instance_label:  0

labels:  [ True]`

aniketzz avatar Oct 27 '21 15:10 aniketzz

That all seems correct, can you then index images as follows without errors?

images[labels]

jklaise avatar Oct 27 '21 15:10 jklaise

yes. images[labels].shape shows shape (1, 256, 256, 3)

aniketzz avatar Oct 27 '21 15:10 aniketzz

Would it be possible for you to share a minimal, self-contained piece of code that re-creates the problem? I would need to run it through the debugger to see where the bug first appears. The model/data don't have to be real as long as they have the same input/output and image shapes.

jklaise avatar Oct 27 '21 15:10 jklaise

One more thing, does the output of your prediction function conform to the shape NxC where N is the number of images in the batch and C is the number of target classes?

jklaise avatar Oct 27 '21 15:10 jklaise

Yes, the output of the predict function is NxC. But I have modified it to take only a single image.

You can get the code from here.

Just place the attached code inside the NudNet/nudenet directory.

The Onnx Model can be found here.

classifier.zip .

aniketzz avatar Oct 27 '21 15:10 aniketzz

Thanks for providing the above. I think there is something wrong with the prediction function, basically if you feed it a batch of images that's bigger than 4 the prediction function only returns predictions on the first (or last?) 4 images and not on the others. This then results in the error that internally the explainers passes in 100 perturbed images but only receives 4 predictions back. The fix would be to modify the prediction function so that it returns all predictions for any batch size.

Edit: I see that classify has an internal variable batch_size = 4 which means than no matter what the batch size, only the first 4 predictions will be returned. This limitation should be removed and then the explainer should work.

jklaise avatar Oct 27 '21 16:10 jklaise

Thanks, I have removed the batch size limitations. But I am still getting the same error while running the explainer.explain() function.

onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Invalid rank for input: input_1:0 Got: 5 Expected: 4 Please fix either the inputs or the model.

Edit: Just to be sure, the input to the explainer.explain() function is an image of shape (1,256,256,3)?

aniketzz avatar Oct 28 '21 04:10 aniketzz

@aniketzz the input to explainer.explain() should be a single image without the batch dimension, so (256,256,3).

jklaise avatar Oct 28 '21 09:10 jklaise

Yes, changing the input shape to (256,256,3) seems to be working without any error but the program is getting killed. Is it because the data (features) is too big for an alibi to process? I saw someone post on the alibi slack channel that the alibi worked well when no. of features we around 30 but took endless time when the feature was 300 or more.

image

aniketzz avatar Oct 28 '21 09:10 aniketzz

That's strange, how long afterwards does it get killed? How big is the model (number of parameters?). Can you see what the memory usage is? The explainer should handle imagenet-size use cases without any issues as evidenced in this example.

Another tip - for AnchorImage the number of pixels in the image is largely irrelevant as the segmentation function segments the image into a small-ish number of image patches which are then treated as "features". It's possible you can experiment with the settings and/or type of the segmentation function to see how many segments a typical image is being segmented into and see if this is "small" (e.g. a dozen patches).

jklaise avatar Oct 28 '21 09:10 jklaise

It gets killed after 10-20 sec. Number of parameters in the model: 20865582.0 The CPU Memory usage is almost 90%

EDIT: Also, I ran a similar TensorFlow based cat classification model with input shape (299,299,3) and it's working. I am able to get the segmented images as output.

aniketzz avatar Oct 28 '21 09:10 aniketzz

I can run the example to explain a randomly generated image of the right shape which takes ~18 seconds and reaches ~16% of memory (32gb machine, so ~5gb which is high but not unexpected). We may need to do some work to see if this can be optimized.

Do you have an exit code for why the process was killed? It sounds like it could be an issue with not having enough RAM, can you check what your memory usage is when running the demo example with Inception and Imagenet?

jklaise avatar Oct 28 '21 12:10 jklaise

The imagenet model takes almost 4GB of RAM out of 8GB. The current ONNX model takes 100% of RAM consumption before it gets killed. SO I think it might be due to insufficient RAM. I will try it on a system with higher RAM memory and will get back to you. Thanks.

aniketzz avatar Oct 28 '21 12:10 aniketzz

Hi @jklaise, I tried the same code on a 16GB RAM system and it worked and I got the explainer segmentation. The RAM memory consumption was ~9.62 GB and it took ~20 Sec to display the final output. Thanks a lot.

Also, I have a question, I want am also using a kubeflow cluster where I have this same model running in a pod. I am able to curl and grpcurl the prediction and I have also loaded explainer.dill for this model in the same pod. When I try to load the image using curl for explainer, it gives an error: payload too large . However, how do I use grpcurl with the explainer?

Here is the curl command:

cmd = f"""curl -k -d @explain_nsfw.txt \
   -X POST https://ea.tensorgo.com/seldon/kubeflow-user-tensorgo-com/vision-models-explainer/v2/models/nsfw-classifier-1/explain \
   -H "Content-Type: application/json" \
   -H 'Cookie: authservice_session=MTYzNTQwOTMwNXxOd3dBTkVsTVVqVlhOemMwVTFFelNGUkpVVmhJUWtKT1Z6WlpOREpYV0VrMlYwTTJWRVJJTms5S1dVOVlVRmd5TmtSU1dWcEpWMUU9fB16lRDQxT61LLGi9eDX3l1s7EB_v58g4mRVb0e09K7B'
   """

Grpcurl command:

cmd1 = f"""/home/tg/Aniket/grpcurl -d @ \
-insecure -H="Cookie: authservice_session=MTYzNDcwOTAxN3xOd3dBTkVkWFRFZEJTRk5MUTFWWldra3pSRmszV2xZeVREWkdRbEUzTkVGSVVsQlRVMEV6VWsxR01scFNUbEZWUkZSV1NFMUxTMUU9fNldsPs3f_alE6RPYpvrr2TDAZPcllRvV8Y1exNBZMV6" -proto ./grpc_service.proto \
-rpc-header seldon:vision-models -rpc-header namespace:kubeflow-user-tensorgo-com \
ea.tensorgo.com:443  \
inference.GRPCInferenceService/ModelInfer < images-explain.txt"""

What am I supposed to place in place of inference.GRPCInferenceService/ModelInfer in the grpcurl command?

aniketzz avatar Oct 29 '21 06:10 aniketzz

Hi @aniketzz How are you wrapping the Alibi explainer so its exposed for REST. At present, in Seldon Core the built in explainers for models only expose HTTP endpoints. However, maybe you are creating a custom server for your Alibi model?

ukclivecox avatar Nov 01 '21 11:11 ukclivecox

Hi @cliveseldon, I will use the HTTP endpoints but how do I fix the payload too large error.

aniketzz avatar Nov 01 '21 12:11 aniketzz

What server are you using to expose an HTTP endpoint?

ukclivecox avatar Nov 01 '21 12:11 ukclivecox

I am using triton and kfserving.

aniketzz avatar Nov 01 '21 12:11 aniketzz

This could be an istio setting issue with Envoy: https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/

ukclivecox avatar Nov 01 '21 12:11 ukclivecox

I am using the default istio setting. and I am also using kubeflow for deployment. I am able to run the prediction function with the same istio setting but gets an error while running explainer. Is there any specific setting that needs to be configured?

aniketzz avatar Nov 01 '21 12:11 aniketzz