torchgeo Add support for bounding boxes and instance masks to VectorDataset

Related to #2505 , a proposal to add bbox_xyxy, segmentation and label as the return values for VectorDataset.__getitem__. After this, VectorDataset.__getitem__ returns

        sample = {
            'mask': masks, # As before
            'bbox_xyxy': boxes_xyxy, # N, 4 tensor containing the bounding boxes in xmin, ymin, xmax, ymax
            'segmentation': segmentations, # A list of N tensors containing the COCO format segmentations
            'crs': self.crs,
            'label': labels, # N tensor containing the int labels for each object
            'bounds': query,
        }

Major downside: as the object detection related things are returned by default, VectorDataset can't be used with most of the common collate_fns, such as stack_samples if the batch size is larger than 1.

Example gist: https://gist.github.com/mayrajeo/1de0497dd82d2c9a2b6381ef482face8

Jun 06 '25 08:06 mayrajeo

@microsoft-github-policy-service agree

Jun 06 '25 08:06 mayrajeo

@adamjstewart Thoughts on adding an "output_type" arg to VectorDataset? My only concern is computing mask/instance mask/boxes for each polygon can add unwanted overhead in the case that I just want boxes, or I just want the mask.

Jun 06 '25 13:06 isaaccorley

I'm not completely opposed. In fact, this is one of my implementation questions in #2505.

I guess I would be curious to know exactly how much overhead we are talking about. As long as the overhead is noticeable (10% slower) and the change is minimal (a few lines of code, plus a new parameter in all subclasses) then let's do it. Only downside of a parameter is that the return type becomes even more dynamic, and it's difficult to statically type check.

Note that Mask R-CNN (our only instance segmentation model) requires both boxes and masks.

Jun 06 '25 14:06 adamjstewart

Yeah I was thinking output types would be [semantic, instance, boxes].

My concern is that instance masks have the ability to take up a ton of memory.

Semantic returns a mask of shape (num_classes, h, w)
Boxes aren't memory intensive (num_objects, 4)
for Instance we are turning boxes, and instance masks of shape (num_objects, h, w)

If num_objects is large (which is common for small object detection) this can take up quite a bit of memory in my experience, so I imagine we shouldn't return all output types and only the one the user wants.

Jun 06 '25 14:06 isaaccorley

I wasn't thinking about the difference between semantic and instance masks, that's a good point. Yeah, we should probably have a knob to control this then.

Jun 06 '25 14:06 adamjstewart

Should the sample keys be consistent so that masks is an empty list or a tensor if semantic segmentation masks are not desired, or depend on the flags specified? Maybe the latter, because it could be possible to add both oriented bounding boxes either as instance masks ([x1, y1, x2, y2, x3, y3, x4, y4]) or detectron2-style ([x_center, y_center, w, h, rotation], -180 <= rotation < 180) annotation format, or keypoints if needed.

Something like


class VectorDataset(GeoDataset):
    ...
    def __init    def __init__(
        self,
        paths: Path | Iterable[Path] = 'data',
        crs: CRS | None = None,
        res: float | tuple[float, float] = (0.0001, 0.0001),
        transforms: Callable[[dict[str, Any]], dict[str, Any]] | None = None,
        masks: bool = True,
        boxes: bool = False,
        instances: bool = False,
        label_name: str | None = None,
        gpkg_layer: str | int | None = None,
    ) -> None:
    ...
    self.masks = masks
    self.boxes = boxes
    self.instances = instances
    ...

    def __getitem__(self, query: BoundingBox) -> dict[str, Any]:
         # Create things depending on self.masks, self.boxes and self.instances
         ...
        sample = {
            'crs': self.crs,
            'bounds': query,
        }
        if self.masks:
            sample['mask'] = masks
       # etc...

maybe?

Jun 09 '25 05:06 mayrajeo

I think it should depend on the flags specified although I don't think we should have a flag, it should be a list of output types that a user can specify.

For OBB I prefer the polygon approach like (x1, y1, ...,x4, y4) instead of using rotation angles just because there's several definitions of the rotation angle and keeping it in polygon format makes easy conversion to<->from shapely/geopandas.

Jun 09 '25 11:06 isaaccorley

The sample keys aren't up to us, they need to follow the format expected by Kornia. Note that Kornia does not yet support OBB, so we need to add this to Kornia first.

Jun 09 '25 12:06 adamjstewart

For the output types, would output_type: List[str] = ['masks'] with possible keys being ['masks', 'boxes', 'instances'] be good? Though masks and instances are mutually exclusive, as kornia seems to expect masks key for both semantic and instance masks correct? Another option might be to specify task, which is one of segmentation, object_detection, instance_segmentation or something like that, since there isn't many cases where only instance masks are needed right?.

Note that Kornia does not yet support OBB, so we need to add this to Kornia first.

They can be thought as polygons, so they need to be converted into binary masks. Getting them from polygons with shapely is easy (oriented_envelope) but they can be omitted for now.

Jun 10 '25 07:06 mayrajeo

See https://github.com/kornia/kornia/blob/v0.8.1/kornia/constants.py#L142 for a list of valid keys. Our TorchGeo object detection trainer is setup to look for bbox_xyxy and label, and our instance segmentation trainer is setup to look for bbox_xyxy, label, and mask. So we should use these keys in the return dict. I'm leaning towards:

task: Literal['object_detection', 'semantic_segmentation', 'instance_segmentation'] = 'semantic_segmentation'`

for the __init__ parameter and default, although those names are quite long.

Jun 15 '25 10:06 adamjstewart

See kornia/kornia@v0.8.1/kornia/constants.py#L142 for a list of valid keys. Our TorchGeo object detection trainer is setup to look for bbox_xyxy and label, and our instance segmentation trainer is setup to look for bbox_xyxy, label, and mask. So we should use these keys in the return dict. I'm leaning towards:
task: Literal['object_detection', 'semantic_segmentation', 'instance_segmentation'] = 'semantic_segmentation'`
for the __init__ parameter and default, although those names are quite long.

They are long but it's likely better to be explicit here tbh. In the future we can add support for multiple output types but let's keep it simple for now. A user can always make separate datasets with different output types and join them.

Jun 16 '25 17:06 isaaccorley

Added tasks, examples here: https://gist.github.com/mayrajeo/596c43c0a0c815e6fc09c1f8e20136cd

Jun 17 '25 10:06 mayrajeo

Great work! Just a few minor comments but we are very close to being done here 🚀

Jun 22 '25 18:06 isaaccorley

Good to see this. Probably covered elsewhere, but be sure to test images with no annotated boxes/masks - Kornia throws errors if empty masks are returned. In my own implementation I have an arg to control the % of empty images returned.

Also I can anticipate training instance seg model with an object detection annotated dataset - I did this with airbus ships and the model still learned to contour the ships mask quite well. Be nice to have this option, perhaps off by default.

Another thing I learned, when random cropping an image, it is useful to use the segmentation to determine the object extent as the box wont always capture the extent of the object in the crop. So an accurate segmentation mask is very desired even for object detection

Jun 23 '25 05:06 robmarkcole

This looks fantastic now! I think we've settled on a good API. Just need to polish and add tests.

Jun 24 '25 20:06 adamjstewart

Added some tests for checking that VectorDataSet works as its supposed to. It's still open how much we need to test e.g. Kornia and empty masks. Existing datasets should work as before, as default task is semantic_segmentation which works exactly as before.

As @robmarkcole suggested, it is possible to add a parameter for determining how small fragments to keep after sampling, such as minimum pixel area or minimum pct of original polygon or bounding box. Question is, which one? Also, using task='instance_segmentation' should work with object detection workflows, as it returns also the bbox_xyxy. Separate task for object_detection is pretty much only useful for not generating the masks when they are not needed. With the current implementation, if the input shapefile has only boxes and task='instance_segmentation', then the output masks are rasterized bounding boxes.

Jul 01 '25 13:07 mayrajeo

Of those 2 options, min pixel count could be better, otherwise a 4 pixel annotation could end up cropped to a single pixel

Jul 01 '25 13:07 robmarkcole

Another possible challenge might be MultiPolygons. In cases of multiple non-overlapping polygons, they can be separated with something like

def separate_multipolygons(shapes_list:list) -> list:
    """Separate multipolygons in `shapes_list` to individual polygons 
    and duplicate their labels. 

    Args:
        shapes_list: list of tuples, where each tuple is (shape, label)

    Returns:
        A list of (shape, label) tuples, with multipolygons removed
    
    .. versionadded:: 0.8
    """
    out_shapes = []

    for s in shapes_list:
        feat, label = s
        if feat['geometry']['type'] == 'MultiPolygon':
            for coords in feat['geometry']['coordinates']:
                new_feat = deepcopy(feat)
                new_feat['geometry'] = {'type': 'Polygon', 'coordinates': coords}
                out_shapes.append((new_feat, label))
        else:
            out_shapes.append(s)
    return out_shapes

and

...
case 'object_detection': # or case 'instance_segmentation'
    # Explode possible multipolygons to individual targets
    shapes = separate_multipolygons(shapes)

    # Get boxes for object detection or instance segmentation
    px_shapes = [
        convert_poly_coords(
            shapely.geometry.shape(s[0]), transform, inverse=True
        )
        for s in shapes
    ]
...

But polygons with holes are a completely different case.

Jul 03 '25 06:07 mayrajeo

My preference would be to keep things simple for now and add complexity only when we encounter a dataset that needs it. This also makes testing easier.

Just need to resolve remaining mypy errors and test coverage. Let us know if you need help with this.

Jul 06 '25 16:07 adamjstewart

I'm on vacation and away from computer for the next month or so, so if someone can resolve the mypy errors it would be great.

Jul 07 '25 05:07 mayrajeo

Some changes I made:

updated tests and coverage
forced labels to be int32 and bbox_xyxy to be float32. I can already see a user getting a phantom error because they set the dataset dtype to uint8 and their boxes are always getting clipped to [0,255] so this fixes that

Jul 14 '25 21:07 isaaccorley

Closing/reopening to try to bump the CLA bot...

Jul 15 '25 18:07 adamjstewart