solaris icon indicating copy to clipboard operation
solaris copied to clipboard

[FEATURE]: Enable masking of unlabeled portions of imagery during model training

Open nrweir opened this issue 5 years ago • 5 comments

This was a topic raised by an attendee at a workshop at FOSS4G International 2019 and is something we should consider implementing.

At the moment, we assume that all pixels in the input imagery are labeled when creating pixel masks and training models. Any unlabeled portion is assumed to have no objects of interest in it. There are some use cases where only a portion of an input image is labeled, and the user only wants the model to learn based on the labeled portion. There are a few ways we could support this:

  1. Enable selective tiling where unlabeled portions are filled with nodata values;
  2. Enable selective masking where unlabeled portions of masks are filled with NaN or something like that;
  3. Find a way to have loss functions have NA values in specific regions of the images so those get ignored during model training.

.

  1. is good though if there's some bias in the labels near the edges of the labeled area, it could introduce bias in the model (it may learn that near nodata values some targets are more/less likely to occur).
  2. Is good but would be hard;
  3. Is good but would be really hard, particularly as we'd have to re-write every loss function anytime we added a new one. To some degree we'd have to do the same for 2., though it would just be with how to handle NaN inputs.

I'm inclined to go with 1. or 2. if we implement this. We'd definitely need tests implemented to make sure it works.

nrweir avatar Aug 28 '19 11:08 nrweir

Hi @nrweir !

I'm also wondering how best to handle NaN values more generally, since I'm currently tossing scenes that have even just a little bit of no data values at the edges using an arbitrary threshold. Could we go with approach 2 but then fill the NaNs with the mean of the training dataset? This might make maintenance and training easier, and I've seen imputing missing data with the mean as a solution in other ML contexts (though not a lot of info on what people do with NaNs for geospatial ML). I'm also unfamiliar with how the different frameworks handle NaN values so the mean filling step might be unnecessary...

If we did handle nodata like this, I think the data filling and label masking step would come after both vector tiling and then running a label mask function on the vector tiles. This would be a bit of a chore and would probably also require a function to go from label masks to coco format (rather than vector tiles to coco format as is currently supported), but it would preserve more training data (which there is always a lack of in geospatial ML).

Curious to hear if you have any thoughts on the right approach for solaris, I'd love to hear them when you get the chance.

rbavery avatar Jan 14 '20 03:01 rbavery

Hey @rbavery. I think mean filling for images themselves is a valid option, but less so for labels. The approach described in 2 above is for the training objective masks (the labels), not the input images. The underlying thinking here is that we need a way to ignore data where no labeling was done, which we're not really doing if we're mean-filling.

nrweir avatar Jan 14 '20 14:01 nrweir

Got it, should have made it clear that what I'm discussing is a separate problem from filling nodata areas outside of an unlabeled boundary. To be more clear, I was discussing how to handle the case where labels are present but no valid image values exist (either due to edge effects or cloud masking). I've made a separate issue here: https://github.com/CosmiQ/solaris/issues/328

rbavery avatar Jan 14 '20 19:01 rbavery

I think we can close this @nrweir, solved by https://github.com/CosmiQ/solaris/pull/331

rbavery avatar Apr 04 '20 19:04 rbavery

I don't think this actually fully solves the intention of this request - specifically, to find a way to mask out areas from contributing to the loss function. So I'll leave it open for now.

nrweir avatar Apr 05 '20 19:04 nrweir