apoc
apoc copied to clipboard
Feature request: Classifier uncertainty
This idea is very much borrowed from Ilastik, which shows an uncertainty
overlay on top of the output classification. I did a quick search, and it seems like uncertainty for random forest classifiers is simply the difference of the two highest probabilities (see also here).
So for a background/foreground segmentation, this would mean the following.
If - for a single pixel - 90 trees vote for class 1
and 10 trees vote for class 2
, the uncertainty would be 1 - (0.9 - 0.1) = 0.2
LMKWYT :)
It's a very cool idea @jo-mueller! It's just tricky to implement with apoc. In very short: each OpenCL kernel (that is derived from a random forest) produces one thing: the ObjectSegmenter produces a label image. The ProbabilityMapper produces a probability map image. Thus, we would need a UncertaintyMapper that produces an uncertainty map. And I'm not exactly sure how to make it user-friendly accessible. Question: what would you do next with the uncertainty map?
And I'm not exactly sure how to make it user-friendly accessible.
Where this is concerned, I would deliver it as an image layer on top of the result label layer. You could then toggle it on/off but by default, it would be produced on-the-fly.
Question: what would you do next with the uncertainty map?
I would see this only as a convenience thing for user-friendliness and not really as something to be used in a downstream analysis. I find that it gives the user (well, me) a good idea about where the classifier is currently performing well and where it isn't - and thus, also where more/less annotation would be placed best. It would also give a more intuitive insight into the effects of the selected features.
Ok, that sounds to me like we build it into the PixelClassifier similar to the ProbabilityMapper so that we can use the same trained Random Forest but generate different OpenCL code from it, code that creates the Uncertainty map. How about a Certainty map? Wouldn't that be more intuitive? Is it legit to compute certainty?
In napari-apoc, we could then build in that uncertainty map generation as a checkbox on the training widget. Does that make sense?
so that we can use the same trained Random Forest
Exactly - this shouldn't be a separate feature/plugin in apoc, but rather appear along with the semantic segmentation.
How about a Certainty map?
Both Uncertainty and Certainty are legit, Certainty would just be 1-Uncertainty
. The only reason for me to lean more towards an uncertainty map is that I would hope for regions of high certainty to be more prevalent than regions of high uncertainty. Using uncertainty would maybe better highlight the problematic zones. But once we are able to generate these maps we can play with it and see what we like more :)
In napari-apoc, we could then build in that uncertainty map generation as a checkbox on the training widget. Does that make sense?
That's how I was envisioning it :)
Certainty would just be
1-Uncertainty
I'm not sure about this. If I'm certain, that's sufficient. Not being uncertain is necessary but might not be sufficient.
I was just referring to the definition from above, which is probably not mathematically strict but gives a reasonable information about whether the classifier is having trouble. For this, the relationship would hold:
Uncertainty = 1 - (highest_probability - 2nd_highest_probability)
If a pixel had two similarly high probabilities for two possible classes, 1 - uncertainty
would be close to zero. I could imagine that this becomes more and more problematic for many classes, though. I would really only use it as a first-order approximation for local performance.
Hey everyone! I would also find this useful. Is it on the roadmap? Anything I can do to help?
Anything I can do to help?
I'd need a mathematically reliable definition of [un]certainty... The ilastik-definition discussed above is not very convincing to me. But if someone comes with a paper where it's explained/ justified,. I'd be open to implement it.
Fair enough. I looked around and could only find similar implementations (e.g., sklearn), but no real explanations 😬 . I'll circle back if I find anything.
I'm also ok if we implement it how ilastik and sklearn do. I just would like to avoid new methods for this which are not appropriate...
How about confidence intervals? https://contrib.scikit-learn.org/forest-confidence-interval/ https://jmlr.org/papers/v15/wager14a.html
I believe this is the sklearn implementation:
On the RandomForestClassifier
class (ensemble):
https://github.com/scikit-learn/scikit-learn/blob/9017c701833114a75903f580dd0772e1d8d7d125/sklearn/ensemble/_forest.py#L839-L885
The individual decision tree method:
https://github.com/scikit-learn/scikit-learn/blob/9017c701833114a75903f580dd0772e1d8d7d125/sklearn/tree/_classes.py#L595-L943
How about confidence intervals? https://contrib.scikit-learn.org/forest-confidence-interval/ https://jmlr.org/papers/v15/wager14a.html
I'll take a look!
Hi @kevinyamauchi @haesleinhuepf ,
How about confidence intervals? https://contrib.scikit-learn.org/forest-confidence-interval/ https://jmlr.org/papers/v15/wager14a.html
very cool find! I wrote a quick notebook (rf classifier from sklearn, raw image, gaussian (sigma=2) & sobel as features) to see what it does to image segmentation and get this for blobs.tif:
Here's the code for that:
import os
from sklearn.ensemble import RandomForestClassifier
from skimage import filters, io
import matplotlib.pyplot as plt
import napari
import numpy as np
import forestci as fci
# load data
image = io.imread(r'C:\Users\johan\Desktop\blobs.tif')
labels = io.imread(r'C:\Users\johan\Desktop\labels.tif')
# create simple features
feature1 = filters.gaussian(image, sigma=2)
feature2 = filters.sobel(image)
# create training data
x_train = np.stack([image[labels != 0],
feature1[labels != 0],
feature2[labels != 0]]).T
y_train = labels[labels != 0]
x_test = np.stack([image.flatten(),
feature1.flatten(),
feature2.flatten()]).T
# train RF classifier
n_trees = 500
classifier = RandomForestClassifier(max_features=5, n_estimators=n_trees,
random_state=42)
classifier.fit(x_train, y_train)
# Calculate errors
error = fci.random_forest_error(classifier, x_train, x_test)
error = error.reshape(image.shape[0], -1)
# visualize
viewer = napari.Viewer()
viewer.add_image(image)
viewer.add_labels(labels)
viewer.add_image(error, blending='additive', colormap='inferno')
What irritates me mildly is that the error never seems to drop below a certain threshold, i.e. the confidence interval is 0.00879 everywhere in the inside/outside of the cells.