alibi-detect icon indicating copy to clipboard operation
alibi-detect copied to clipboard

Getting wrong Result with ClassifierDrift

Open karthik-v-b opened this issue 2 years ago • 7 comments

Hello,

I have used ClassifierDrift to detect the drift in the image with my own dataset.

The reference data is

Ntest (2) - Copy - Copy

and the data given for prediction is

imageedit_pixelate - Copy (2) which is a pixelated image of the reference data

and i have used cv2 to get the pixel values of the image and passed that to the ref and corrupted data with the same model

from tensorflow.keras.layers import Conv2D, Dense, Flatten, Input

tf.random.set_seed(0)

model = tf.keras.Sequential( [ Input(shape=(32, 32, 3)), Conv2D(8, 4, strides=2, padding='same', activation=tf.nn.relu), Conv2D(16, 4, strides=2, padding='same', activation=tf.nn.relu), Conv2D(32, 4, strides=2, padding='same', activation=tf.nn.relu), Flatten(), Dense(2, activation='softmax') ] )

but the result i am getting is there is no drift.

karthik-v-b avatar Jul 01 '22 03:07 karthik-v-b

Hi again @karthik-v-b, before delving more into the details, can I just check that you are in fact passing in a reference data set and a test data set? i.e. batches of instances and not single instances. Just wanted to check as you refer to "image" and not "images".

ascillitoe avatar Jul 01 '22 15:07 ascillitoe

Hi @ascillitoe, here is the code snippet for your reference

import numpy as np import tensorflow as tf from alibi_detect.cd import ClassifierDrift import cv2 import glob

image_array = []

files = glob.glob ("C:/Users/KART/OneDrive/Documents/AI/USDD/MNIST7/Train/*.jpg") for myFile in files: image = cv2.imread (myFile) image_array.append (image)

image_array = np.array(image_array) image_array = image_array.astype('float32')/255

image_array_noisy = []

file = glob.glob (r"C:\Users\KART\OneDrive\Documents\AI\USDD\MNIST7\Test*.jpg") for myFiles in file: images = cv2.imread (myFiles)
image_array_noisy.append (images)

image_array_noisy = np.array(image_array_noisy) image_array_noisy = image_array.astype('float32')/255

image_array_noisy.shape[1:len(image_array_noisy)]

from tensorflow.keras.layers import Conv2D, Dense, Flatten, Input

#tf.random.set_seed(0)

model = tf.keras.Sequential( [ Input(shape = image_array_noisy.shape[1:len(image_array_noisy)]), Conv2D(8, 4, strides=2, padding='same', activation=tf.nn.relu), Conv2D(16, 4, strides=2, padding='same', activation=tf.nn.relu), Conv2D(32, 4, strides=2, padding='same', activation=tf.nn.relu), Flatten(), Dense(2, activation='softmax') ] )

cd = ClassifierDrift(image_array, model, p_val=0.05, epochs=5)

labels = ['No!', 'Yes!'] pred = cd.predict(image_array_noisy)
print('Drift? {}'.format(labels[pred['data']['is_drift']])) print(f'distance: {pred["data"]["distance"]:.3f}')


The train folder consists of 5 .jpg files of same image Ntest (2) - Copy - Copy

and the test folder consists of 5 .jpg files of same image as attached below imageedit_pixelate - Copy (2) - Copy

karthik-v-b avatar Jul 04 '22 05:07 karthik-v-b

Hi @karthik-v-b, please see this gist for a version of your notebook that should now work as intended. I've made a number of changes here:

  • you had a typo image_array_noisy = image_array.astype('float32')/255 so your test image(s) was actually a scaled version of your reference image(s).
  • After fixing the above, I noticed your two images were of different resolutions. The resolutions are also prohibitively high, so I have rescaled both to 256x256 pixels.
  • Reference and test data with only 5 instances is too small to realistically train the convolutional neural network. I've increased this to 100 instances, but even this is a little small really...
  • I've tweaked the dimensions of the CNN a bit, as the 256x256 images are higher resolution than the images in the CIFAR10 example I assume you got the original CNN model from.
  • Running this with verbose=1 you can see the CNN is now being successfully trained to classify between reference and test data (loss_ma is reducing).

Note that the classifier model is trained on the combined reference and test data (X_ref and X_test in my notebook) when predict is called. You can investigate whether your model is successful at this by running ``cd._detector.model(X_ref)orcd._detector.model(X_test)and examining the results (this must be done after the detectorpredict` is called).

On a side note, will you have more images in practice? Or do you really only want to determine whether single images are similar? If you want to do the latter you might be better off using some sort of similarity metric rather than framing this as a drift detection problem. Drift detection aims to determine whether two datasets are sampled from the same statistical distribution, rather than determining whether two single instances are the same.

ascillitoe avatar Jul 06 '22 09:07 ascillitoe

I am unable to feed large amount of data, while training the model Could please help me Thanks

sathyagorla avatar Jul 29 '22 18:07 sathyagorla

Hi @sathyagorla, to deal with large datasets you would ordinarily deal with your data in batches via the batch_size kwarg's. Exactly where and how you do this depends on your dataset, model, and choice of detector. Can you give some more information about what you are trying to do, please? If it isn't related to @karthik-v-b's query above please open a new issue for this.

ascillitoe avatar Aug 02 '22 14:08 ascillitoe

Hi @ascillitoe, Thanks for your replay, Now I am able to run the code with large datasets with batches Here we are doing training the model, but we are not using the trained model in classifier drift, and also in this code https://github.com/SeldonIO/alibi-detect/blob/master/alibi_detect/cd/tensorflow/classifier.py we are checking the drift between in testing data points only, but we are not checking between train data and test data could you please help me with this Thanks

sathyagorla avatar Aug 04 '22 07:08 sathyagorla

Hi @sathyagorla, glad to hear you are able to run on large datasets now 👍🏻

Regarding your follow-up question, I'm afraid I am not entirely clear about what you are asking. The ClassifierDrift detector combines the reference set given via ClassifierDrift(x_ref) with the test set given via .predict(x). This combined dataset is then used to train a classifier, with the labels denoting whether each instance belongs to the reference or test set. The end result is a classifier trained to predict the probability that a given test instance belongs to the test set. If the probabilities it assigns to unseen test instances are significantly higher than those it assigns to unseen reference instances then the test set must differ from the reference set and drift is flagged.

Going back to your question, I am not sure what you mean by checking drift between test points only? What is the reference set that you wish to check for drift against? It sounds like you are more interested in measuring the similarity between test instances, or perhaps checking for outliers within your test set?

ascillitoe avatar Aug 05 '22 11:08 ascillitoe