label-maker
label-maker copied to clipboard
Brainstorming replacing QA-Tiles
I need to rework my https://github.com/jremillard/images-to-osm project to use Mapbox tiles. The problem that label-maker is attempting to solve is right at the center of the planned rework. I just wanted to communicate what label-maker would look like if it was a perfect fit for my needs.
The input data (training ) to label-maker should be a set of geojson files. There is a rich and mature existing infrastructure of generating them from OSM and other data sources. They are easy to write code against in any language. Let other tools deal with it.
Label maker config would be
- output zoom level OR a metric output (.5 m/pixel).
- output image size for the training network (say 800x800), not an even tile boundary.
- data augmentation options (center object, randomly slide object around, up/down, left/right flips, % scale change, edge buffer zone, allow clipped features, etc).
- How many sample images to make.
- training/validation split %.
- Sat image TMS URL (someday support Bing when they can change the license).
- Max sat image cache size, directory, also need max ago of sat image cache (mapbox is 30 days).
- % of images to create that are negative samples (no objects in them).
The final output would be intermediate files (training, and validating), not the training images.
When the network is training, the intermediate files can be opened up, and single images can be generated on the fly from a python module. The python module would handle either fetching and forming the training images or getting them from the sat image cache. It would stitch the sat images together, crop them correctly, and output bounding boxes, segmentation masks, and instance masks. The one image at a time would allow data sets that don't fit into memory to be used, keep performance good, and not violate sat image caching licensing restrictions.
If you want to be really nice to people, have an option to write out MS COCO files, since basically everyone is using that data set right now for benchmarking.
training/validation split %.
I think when working with satellite imagery, it's important to have separate training and validation regions. If you just make a bunch of chips from some region and then randomly partition them, the validation chips may spatially overlap with some of training chips which will be "cheating".