geo-deep-learning icon indicating copy to clipboard operation
geo-deep-learning copied to clipboard

Input data validation: define a uniform procedure

Open remtav opened this issue 2 years ago • 0 comments

PR #309 introduces an AOI object that performs quality control on any input data that uses this object. In current version, invalid data will raise an exception.

What is invalid data in current version (including modifications from #309)?

  • A raster that can't be opened with rasterio
  • A raster that contains a different number of bands that expected (ex.: actual is RGB, expected is RGBN)
  • A vector ground truth file that can't be opened with geopandas
  • A vector file that contains a different number of classes (unique values from a particular attribute field) than expected*
  • this last validation will not succeed if a vector file contains only a subset of num_classes (e.g. 3 of 4). Should we only check that number of expected classes is not smaller than actual (not vice-versa) ?

Should other criteria determine if data is valid? Some ideas:

  • Imagery and ground truth's bounds don't overlap
  • Imagery and ground truth's bounds overlap under a certain threshold (ex.: 50%). This covers cases when provided ground truth or imagery is accidentally a partial "fit" with is counterpart.

Some "data validation" logic questions:

  • Should errors from invalid data be "catched", ignored during execution and sent to user via a report of some sort (ex.: csv with some detail)
  • Should this logic be applied equally in sampling (tiling) and inference? A uniform approach is much simpler.
  • Should an existing report trigger the bypassing of future validation for the same dataset? Some validation steps can be costly (ex.: validate individual features of a vector file). This kind of bypassing could be worth it.

remtav avatar May 24 '22 18:05 remtav