Add an active learning module for deepforest
Copying the thread comment from BOEM:
I used binary-focused uncertainty rather than multi-class methods (like entropy sampling) since:
Our model outputs single confidence scores, not multi-class probabilities. But I am confused on whether I should also add the multi-class functionalities. Binary margin sampling is computationally efficient and better for our use case.
Entropy sampling would yield nearly identical results for binary classification but it will add unnecessary complexity The most uncertain samples (scores closest to 0.5) get prioritized for annotation, which should help improve model performance around the decision boundary where it's currently struggling.
I am currently also exploring a new sampling technique "MC Dropout"
I think it would be useful to briefly summarize the anticipated workflow here. I'm trying to think what our minimal example would be. I would suggest starting with a pool of existing images, train a model, run sample selection to obtain new images (from a second pool), re-run training and compare metrics.
IMO the first sample selector we test should be random to establish a baseline. If we use an existing pool of images to test this, be careful about leaking data, so you might want to do something like:
- Split NeonTreeEvaluation into 40/40/20 by site
- Take the first 40% as train/val (call this A)
- Take the second 40% as the sample selection pool (call this B)
- Take the last 20% as holdout (test)
- Train a model on A, save the checkpoint
- Sample from B using this model for selection
- Train a model on the enlarged dataset, save the checkpoint
- Evaluate both on C and compare
Then once that works, we could swap out different sample selectors and see what works best. And then consider expanding to include other datasets (e.g. from MillionTrees).
One challenge with active learning for object detection is how you aggregate uncertainty over the predictions. AL methods often assume you have a simple output like image classification (which can be multi-class, multi-label) that can be summarized as a scalar. But when you have maybe 100 predictions to deal with, how do you assess that? I looked into the literature for this a couple of years ago and found it a bit lacking, would be useful to revisit that. Averaging over objects seems to be as common as anything else.
You could also go with diversity sampling and try and pick the "most different" looking images from the pool and ignore predictions/model confidence entirely. But it's worth thinking about the physical/ecological interpretation too: suppose a user wants to fine-tune a model for their environment, and they have a few sample images. Perhaps a useful tool would be to sample a larger corpus of images (labelled or not) for similar looking scenes that could be used for training? This doesn't seem quite the same as picking images where the model is unsure, because if you want to train a model that works in a tropical region for the lowest training cost, you probably care less if you include images from boreal pine forest even if the model isn't every good there.