eli5 icon indicating copy to clipboard operation
eli5 copied to clipboard

LIME for multiclass/multilabel explanations?

Open Hellisotherpeople opened this issue 4 years ago • 3 comments

I implemented a multilabel prediction algorithm for NLP text classification.

Basically, use multilabelbinarizer, binary_crossentropy, and a final sigmoid activation. Since I'm not using a softmax on my output neurons - it's allowed to make a prediction on any class.

This means that the probabilities in my "predict_proba" function do not sum to 1, and this causes issues with LIME.

Is there a way around this? Can LIME work for Multilabel classification models?


I also tried a simpler multilabel version of my previous problem, where I guarantee that I have the same number of labels for each instance. My predict_probs function gives a list of arrays, where each array is the probability values for picking the particular label (Which do sum to 1). Can ELI5 handle this kind of Data? Shouldn't it be easy to write a wrapper to handle this?

Hellisotherpeople avatar Sep 06 '19 09:09 Hellisotherpeople

I AM A GOD (not really but this crazy idea of mine just worked!!!!!)

class MultiLabelProbClassifier(BaseEstimator, ClassifierMixin):

    def __init__(self, clf):
        self.clf = clf

    def fit(self, X, y):
        self.clf.fit(X, y)

    def predict(self, X):
        ret = self.clf.predict(X)
        return ret

    def predict_proba(self, X):
        if len(X) == 1:
            self.probas_ = self.clf.predict_proba(X)[0]
            sums_to = sum(self.probas_)
            new_probs = [x / sums_to for x in self.probas_]
            return new_probs
        else:
            self.probas_ = self.clf.predict_proba(X)
            print(self.probas_)
            ret_list = []
            for list_of_probs in self.probas_:
                sums_to = sum(list_of_probs)
                print(sums_to)
                new_probs = [x / sums_to for x in list_of_probs]
                ret_list.append(np.asarray(new_probs))
            return np.asarray(ret_list)



the_model = MultiLabelProbClassifier(model)
pipe = Pipeline([('text2vec', Text2Vec()), ('model', the_model)])
pipe.fit(X_train, Y_train)

pred = pipe.predict(X_val)


te = TextExplainer(random_state=42, n_samples=300, position_dependent=True)

def explain_pred(sentence):
    te.fit(sentence, pipe.predict_proba)
    t_pred = te.explain_prediction()
    #t_pred = te.explain_prediction(top = 20, target_names=["ANB", "CAP", "ECON", "EDU", "ENV", "EX", "FED", "HEG", "NAT", "POL", "TOP", "ORI", "QER","COL","MIL", "ARMS", "THE", "INTHEG", "ABL", "FEM", "POST", "PHIL", "ANAR", "OTHR"])
    txt = format_as_text(t_pred)
    html = format_as_html(t_pred)
    html_file = open("latest_prediction.html", "a+")
    html_file.write(html)
    html_file.close()
    print(te.metrics_)

Basic idea is to take a set of probabilities that don't sum to 1 and just force them to sum to 1 anyway. So, maybe my probabilities sum to 1.985, then divide each item in the probability list by 1.985

Now, ELI5 / TextExplainer / LIME give me predictions for each label EVEN IN MULTILABEL output problems. All a user has to do is multiply the LIME predicted output by sum_probabilities for the real probabilities.

Maybe someone should add this as a tutorial or a PR into ELI5.

Hellisotherpeople avatar Sep 08 '19 17:09 Hellisotherpeople

One thing to do is to specify to user what the "sum_probabilities" number is somewhere within the html / text explain pred output (or just do that multiplication for them afterwards)

Hellisotherpeople avatar Sep 08 '19 17:09 Hellisotherpeople

Hi @Hellisotherpeople, I have a very similar issue with you. I have a medical document, which corresponds to several ICD codes (labels). Actually, there are more than 8000 ICD codes in total, and each document corresponds to about 5 to 10 ICD codes; then I use the multi-hot and the sigmoid function. How's it going with your project now? May you give me some suggestions on my work? Thanks : )

deweihu96 avatar Feb 24 '21 00:02 deweihu96