apoc icon indicating copy to clipboard operation
apoc copied to clipboard

Feature request: Classifier uncertainty

Open jo-mueller opened this issue 2 years ago • 14 comments

This idea is very much borrowed from Ilastik, which shows an uncertainty overlay on top of the output classification. I did a quick search, and it seems like uncertainty for random forest classifiers is simply the difference of the two highest probabilities (see also here).

So for a background/foreground segmentation, this would mean the following.

If - for a single pixel - 90 trees vote for class 1 and 10 trees vote for class 2, the uncertainty would be 1 - (0.9 - 0.1) = 0.2

LMKWYT :)

jo-mueller avatar Jul 20 '22 14:07 jo-mueller

It's a very cool idea @jo-mueller! It's just tricky to implement with apoc. In very short: each OpenCL kernel (that is derived from a random forest) produces one thing: the ObjectSegmenter produces a label image. The ProbabilityMapper produces a probability map image. Thus, we would need a UncertaintyMapper that produces an uncertainty map. And I'm not exactly sure how to make it user-friendly accessible. Question: what would you do next with the uncertainty map?

haesleinhuepf avatar Jul 20 '22 17:07 haesleinhuepf

And I'm not exactly sure how to make it user-friendly accessible.

Where this is concerned, I would deliver it as an image layer on top of the result label layer. You could then toggle it on/off but by default, it would be produced on-the-fly.

Question: what would you do next with the uncertainty map?

I would see this only as a convenience thing for user-friendliness and not really as something to be used in a downstream analysis. I find that it gives the user (well, me) a good idea about where the classifier is currently performing well and where it isn't - and thus, also where more/less annotation would be placed best. It would also give a more intuitive insight into the effects of the selected features.

jo-mueller avatar Jul 20 '22 20:07 jo-mueller

Ok, that sounds to me like we build it into the PixelClassifier similar to the ProbabilityMapper so that we can use the same trained Random Forest but generate different OpenCL code from it, code that creates the Uncertainty map. How about a Certainty map? Wouldn't that be more intuitive? Is it legit to compute certainty?

In napari-apoc, we could then build in that uncertainty map generation as a checkbox on the training widget. Does that make sense?

haesleinhuepf avatar Jul 20 '22 20:07 haesleinhuepf

so that we can use the same trained Random Forest

Exactly - this shouldn't be a separate feature/plugin in apoc, but rather appear along with the semantic segmentation.

How about a Certainty map?

Both Uncertainty and Certainty are legit, Certainty would just be 1-Uncertainty. The only reason for me to lean more towards an uncertainty map is that I would hope for regions of high certainty to be more prevalent than regions of high uncertainty. Using uncertainty would maybe better highlight the problematic zones. But once we are able to generate these maps we can play with it and see what we like more :)

In napari-apoc, we could then build in that uncertainty map generation as a checkbox on the training widget. Does that make sense?

That's how I was envisioning it :)

jo-mueller avatar Jul 21 '22 12:07 jo-mueller

Certainty would just be 1-Uncertainty

I'm not sure about this. If I'm certain, that's sufficient. Not being uncertain is necessary but might not be sufficient.

haesleinhuepf avatar Jul 22 '22 14:07 haesleinhuepf

I was just referring to the definition from above, which is probably not mathematically strict but gives a reasonable information about whether the classifier is having trouble. For this, the relationship would hold:

Uncertainty = 1 - (highest_probability - 2nd_highest_probability)

If a pixel had two similarly high probabilities for two possible classes, 1 - uncertainty would be close to zero. I could imagine that this becomes more and more problematic for many classes, though. I would really only use it as a first-order approximation for local performance.

jo-mueller avatar Jul 24 '22 18:07 jo-mueller

Hey everyone! I would also find this useful. Is it on the roadmap? Anything I can do to help?

kevinyamauchi avatar Dec 17 '22 01:12 kevinyamauchi

Anything I can do to help?

I'd need a mathematically reliable definition of [un]certainty... The ilastik-definition discussed above is not very convincing to me. But if someone comes with a paper where it's explained/ justified,. I'd be open to implement it.

haesleinhuepf avatar Dec 17 '22 05:12 haesleinhuepf

Fair enough. I looked around and could only find similar implementations (e.g., sklearn), but no real explanations 😬 . I'll circle back if I find anything.

kevinyamauchi avatar Dec 17 '22 07:12 kevinyamauchi

I'm also ok if we implement it how ilastik and sklearn do. I just would like to avoid new methods for this which are not appropriate...

haesleinhuepf avatar Dec 17 '22 07:12 haesleinhuepf

How about confidence intervals? https://contrib.scikit-learn.org/forest-confidence-interval/ https://jmlr.org/papers/v15/wager14a.html

haesleinhuepf avatar Dec 17 '22 08:12 haesleinhuepf

I believe this is the sklearn implementation:

On the RandomForestClassifier class (ensemble):

https://github.com/scikit-learn/scikit-learn/blob/9017c701833114a75903f580dd0772e1d8d7d125/sklearn/ensemble/_forest.py#L839-L885

The individual decision tree method:

https://github.com/scikit-learn/scikit-learn/blob/9017c701833114a75903f580dd0772e1d8d7d125/sklearn/tree/_classes.py#L595-L943

kevinyamauchi avatar Dec 18 '22 20:12 kevinyamauchi

How about confidence intervals? https://contrib.scikit-learn.org/forest-confidence-interval/ https://jmlr.org/papers/v15/wager14a.html

I'll take a look!

kevinyamauchi avatar Dec 18 '22 20:12 kevinyamauchi

Hi @kevinyamauchi @haesleinhuepf ,

How about confidence intervals? https://contrib.scikit-learn.org/forest-confidence-interval/ https://jmlr.org/papers/v15/wager14a.html

very cool find! I wrote a quick notebook (rf classifier from sklearn, raw image, gaussian (sigma=2) & sobel as features) to see what it does to image segmentation and get this for blobs.tif:

result

Here's the code for that:

import os
from sklearn.ensemble import RandomForestClassifier
from skimage import filters, io
import matplotlib.pyplot as plt
import napari
import numpy as np
import forestci as fci

# load data
image = io.imread(r'C:\Users\johan\Desktop\blobs.tif')
labels = io.imread(r'C:\Users\johan\Desktop\labels.tif')

# create simple features
feature1 = filters.gaussian(image, sigma=2)
feature2 = filters.sobel(image)

# create training data
x_train = np.stack([image[labels != 0],
                    feature1[labels != 0],
                    feature2[labels != 0]]).T
y_train = labels[labels != 0]

x_test = np.stack([image.flatten(),
                   feature1.flatten(),
                   feature2.flatten()]).T

# train RF classifier
n_trees = 500
classifier = RandomForestClassifier(max_features=5, n_estimators=n_trees,
                                  random_state=42)
classifier.fit(x_train, y_train)

# Calculate errors
error = fci.random_forest_error(classifier, x_train, x_test)
error = error.reshape(image.shape[0], -1)

# visualize
viewer = napari.Viewer()
viewer.add_image(image)
viewer.add_labels(labels)
viewer.add_image(error, blending='additive', colormap='inferno')

What irritates me mildly is that the error never seems to drop below a certain threshold, i.e. the confidence interval is 0.00879 everywhere in the inside/outside of the cells.

jo-mueller avatar Dec 19 '22 13:12 jo-mueller