RasterDataset intersection loads multiple files if overlapped
As NAIP tiles have an overlap, the intersection call in in __getitem__ returns multiple filepaths if a patch is loaded on the edge of a tile.
However, at least one (or multiple files) contain the query completely if e.g. RandomBatchSampler is used as it only samples inside the bounds of a hit.
In the worst case the patch is loaded four times instead of once. Filtering out filepaths if one of the tiles completely contains the patch yields about a 2x performance increase.
There may be good reasons to merge overlap, e.g. the overlap is needed for accuracy or something.
I am not sure this is true for NAIP (or other datasets), an option to only load it once would be good for performance. Maybe this could also be handled by filtering the bounds directly in the __init__.
My understanding with NAIP is that the tiles are arranged in a grid like:
+---+---+
| A | B |
+---+---+
| C | D |
+---+---+
So if the query patch is in the top right corner of C, index.intersection will return all 4 tiles (rtree considers two rectangles with a shared edge or corner to be overlapping even if the area of overlap is 0). However, this check in RandomBatchGeoSampler should make sure that only patches with area > 0 are considered. Are you using TorchGeo 0.2.1 or newer?
Yes it is a grid but the tiles overlap (I am not sure if there is a good reason for that).
At least they do when I checked on https://earthexplorer.usgs.gov/ for California.
I actually added a small check to __get_item__ that only uses a single filepath if the hit fully contains the query which leads to the performance improvement.
Gotcha. I think your solution may work for NAIP because we know that the image tiles are perpendicular to the axes of the CRS they are stored in. However, I don't think your solution would work in general. Imagine something like Landsat where each image tile is actually rotated with nodata values filling in the corners of the image:
We have no way to access the 4 corners of the image itself, only the 4 corners of the bounding box of that image. So even if 4 tiles all overlap each other and the query is fully included in all 4 tiles, only one of those tiles may actually have any data for that query. That's the real reason we are merging all hits together.
Thanks, that's exactly what I was wondering. So there is a good case for overlap of the images.
A possible solution may be to load the hit the query originates from first and if not all data is retrieved to merge it with the other intersection hits.
I am unsure what the performance would be of that approach but the performance hit in NAIP is quite significant so if there are other datasets with overlap this may be worth it in the end. (could also be a flag depending on dataset)
Another situation might be trying to composite images from different times. In that case, even if both patches lack nodata values, one might be preferred over the other if one is cloudy. Not sure if rasterio.merge has any heuristics that try to handle something like that.
I'll keep this issue in mind. I'm trying not to reinvent the wheel too much, but maybe a heuristic like what you're proposing could be added directly to rasterio.
Hmm, I see that does make it more difficult in the general case.
A good way would maybe be to add something like a collate_fn for selecting the filepaths (so in the NAIP case select one that contains the query fully) and a second function to merge them (in the case you described selecting the one with the least clouds or averaging over time etc., in the standard case just rasterio.merge).