stackstac
stackstac copied to clipboard
stackstac.stack doesn't read overviews
The documentation implies that setting resolution
uses overview, but as far as I can see, that it doesn't.
It seems that stackstac.stack
doesn't use or can read COG overviews. This prevents several use cases such a large composites. The parameter resolution
seems to be an output spec, not an input or read spec.
I can test this by the timings here (as a one notebook here) trying to get the same resolution from a large COG:
import stackstac
from pystac_client import Client
import planetary_computer as pc
catalog = Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1",
modifier=pc.sign_inplace,
)
eiffel={"coordinates": [2.3,48.85],"type": "Point"}
Jan2021={"interval": ["2021-01-01", "2021-01-31"]}
search = catalog.search(filter_lang="cql2-json", filter={
"op": "and",
"args": [
{"op": "s_intersects", "args": [{"property": "geometry"}, eiffel]},
{"op": "anyinteracts", "args": [{"property": "datetime"}, Jan2021]},
{"op": "=", "args": [{"property": "collection"}, "sentinel-1-rtc"]}
]})
items = search.item_collection()
and then use stackstac.stack
:
data = (
stackstac.stack(
items[0],
assets=["vv"],
chunksize=512,
resolution=640,
epsg=32643
))
data
%%time
data.squeeze().plot.imshow(vmin=0, vmax=0.4)
Yields
CPU times: user 43.8 s, sys: 10.3 s, total: 54.1 s
Wall time: 2min 2s
Versus
import rioxarray as rx
data2=rx.open_rasterio(items[0].assets['vv'].href,overview_level=5)
print(data2.rio.resolution())
data2.squeeze().plot.imshow(vmin=0, vmax=0.4)
yields
(640.8371040723982, -640.7645259938838)
CPU times: user 39.1 ms, sys: 695 µs, total: 39.8 ms
Wall time: 70.6 ms
cc @TomAugspurger
More specifically, my issue is to create a stackstac that reads overviews, so I can for example create a median value composite over a country-size roi reading overviews.
Slightly simpler example:
import pystac
import planetary_computer
import rioxarray
import stackstac
item = planetary_computer.sign(pystac.read_file("https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-1-rtc/items/S1A_IW_GRDH_1SDV_20210129T174036_20210129T174101_036356_044426_rtc"))
href = item.assets["vv"].href
ds = rioxarray.open_rasterio(href, overview_level=5)
ds2 = stackstac.stack([item], resolution=((640.8371040723982, 640.7645259938838)))
With rasterio reading the overview explicitly:
%time _ = ds.load()
CPU times: user 3.3 ms, sys: 7.8 ms, total: 11.1 ms
Wall time: 84.5 ms
Via stackstac
%time _ = ds2.load()
CPU times: user 1min 6s, sys: 11.1 s, total: 1min 17s
Wall time: 1min 30s
Next step would likely be to inspect the VRT that stackstac is building to figure out why GDAL isn't using the overviews when reading via it.
I'm not sure how stackstac is using VRT but if it is using rasterio.vrt.WarpedVRT
it needs be aligned with the output bounds bounds in order to fetch the overviews.
The key to triggering use of the overviews is this: the resolution of your output, defined by out_shape in your case, must be smaller than the VRT resolution. For example, if your VRT is 1024 x 1024 pixels, you could expect to use any 2x overviews when you call read(out_shape=(3, 512, 512)).
I have an example at https://github.com/mapbox/rasterio/blob/master/tests/test_warpedvrt.py#L329 of external overviews being used when decreasing the resolution of read data by a factor of two.
ref: https://github.com/rasterio/rasterio/issues/1373#issuecomment-398261502
Thanks Vincent! That's good to know.
A couple modifications to my example, which should get us closer (but won't necessarily work for the full example):
ds2 = stackstac.stack([item], resolution=((640.8371040723982, 640.7645259938838)), assets=["vv"], dtype="float32", bounds=ds1.rio.bounds(), snap_bounds=False).squeeze()
ds2
Still slow. This (maybe?) gets the VRT that stackstac is building. It's perhaps not identical, since I'm not getting it directly out of the task graph. Instead, I'm rebuilding it and might have messed something up:
import numpy as np
import rasterio.enums
dsk = ds2.__dask_graph__()
asset_table = dsk[list(dsk)[0]]
spec = ds2.attrs["spec"]
resampling = rasterio.enums.Resampling.nearest
dtype = "float32"
fill_value = np.nan
rescale = True
gdal_env = None
errors_as_nodata = ()
reader = stackstac.rio_reader.AutoParallelRioReader
x = stackstac.to_dask.asset_table_to_reader_and_window(
asset_table,
spec,
resampling,
dtype,
fill_value,
rescale,
gdal_env,
errors_as_nodata,
reader,
)
reader, window = x[0, 0]
It does look like the WarpedVRT construction behind the scenes results in many GET requests to the full-resolution data rather than using an overview, a simple test is to check out the rasterio logs trying to just extract a single pixel value:
import logging
logger = logging.getLogger('rasterio')
logger.level = logging.DEBUG
logger.addHandler(logging.StreamHandler())
item = planetary_computer.sign(pystac.read_file("https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-1-rtc/items/S1A_IW_GRDH_1SDV_20210129T174036_20210129T174101_036356_044426_rtc"))
href = item.assets["vv"].href
# Fixing the metadata (see linked issue above)
item.properties['proj:shape'] = [20953, 28325]
item.properties['proj:bbox'] = [325900.0, 5265850.0, 609150.0, 5475380.0]
da = stackstac.stack([item],
resolution=((640.8371040723982, 640.7645259938838)),
assets=["vv"],
dtype="float32",
xy_coords='center',
snap_bounds=False
).squeeze()
# 327 GET Requests! (should be 2-3)
with rasterio.Env(CPL_CURL_VERBOSE=True):
da.isel(x=0,y=0).load()
I wonder if setting resolution only should just avoid WarpedVRT altogether (with a check that all items have same EPSG)?
Or perhaps stackstac.stack()
could take keyword arguments like overview_level
that are passed to rasterio.open()
which is what rioxarray does?
Interesting, thanks for reporting this @brunosan.
Basically sounds like GDAL / rasterio only uses overviews when an out_shape
argument is passed to read
. I'll call this "defining the resolution at read
time".
stackstac is currently constructing a VRT up front which defines both the output CRS and the output resolution. So we're defining the resolution at dataset-open time, instead of at read time. The downscaling is defined in by the VRT, not by the read
call.
I'd assumed these would be have equivalently, but it sounds like we actually need to construct the VRT at full resolution (and only if we need to change CRSs, otherwise no need for a VRT at all), then call read
with an out_shape
for every chunk.
Thanks everyone for the digging. In the meantime, for reference of others who need a workaround. It seems that ODC-STAC can handle these cases reading overviews
import odc.stac,odc
import numpy as np
import planetary_computer
import pystac
item = planetary_computer.sign(pystac.read_file("https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-1-rtc/items/S1A_IW_GRDH_1SDV_20210129T174036_20210129T174101_036356_044426_rtc"))
ds = odc.stac.load(
[item],
chunks={},
bands="vv",
crs="EPSG:32643",
resolution=odc.geo.Resolution(640, -640)).where(lambda x: x > 0, other=np.nan)["vv"]
%time _ = ds.load()
CPU times: user 36.6 ms, sys: 27.7 ms, total: 64.3 ms
Wall time: 319 ms
Yeah, odc-stac is explicitly picking an overview level:
https://github.com/opendatacube/odc-stac/blob/da03bde0ffafb158e8f56429c9b5667ee38901e3/odc/stac/_reader.py#L229
Additionally, odc-stac is reading directly through a warp when reprojection is required, rather than a VRT like stackstac: https://github.com/opendatacube/odc-stac/blob/da03bde0ffafb158e8f56429c9b5667ee38901e3/odc/stac/_reader.py#L144-L152
@Kirill888 / @vincentsarago, do you know if there's an advantage to using warp directly for reprojection, instead of a VRT?
do you know if there's an advantage to using warp directly for reprojection, instead of a VRT?
I looked at it properly way before gdal 3, back then I could not convince VRT to use larger block size than 256x256, and at that size it was painfully slow.
As far as overviews go, I could not reliably trigger use of overviews when combining read with projection change, only reliable way to force use of overview was to read at the native resolution of the overview itself. ~Sentinel-1 uses GCPs instead of Affine mapping, I'm guessing that would the reason for GDAL to go to native unless explicitly using specific overview image.~
EDIT: I see this is related to S1 RTC which are not GCP
On my side, I know we spent quite some time making sure we were getting data from the overview (see Sean's comment https://github.com/rasterio/rasterio/issues/1373#issuecomment-398261502) (other interesting link https://github.com/cogeotiff/rio-tiler/pull/132)
Using WarpedVRT works to fetch the overview but you need to take extra care about how you construct the WarpedVRT (see Sean's comment and https://github.com/cogeotiff/rio-tiler/blob/25fe5b30e743388027f3ae1f0cc96177cd6996fe/rio_tiler/utils.py#L223-L298). Something I just changed in rio-tiler, is that when no reprojection is needed we don't use WarpedVRT anymore, but I'm not sure it changes a lot the performance 🤷
As for
do you know if there's an advantage to using warp directly for reprojection, instead of a VRT?
I don't. It would be great to see some benchmark (maybe using https://github.com/developmentseed/tilebench)
I just added some tests in rio-tiler to make sure we're fetching overview (even when using WarpedVRT or a WarpedVRT): https://github.com/cogeotiff/rio-tiler/blob/main/tests/test_overview.py
I've made 2 files where each IFD has it's own value, so it's pretty easy to know which IFD you are reading:
- COG with internal gcps: https://github.com/cogeotiff/rio-tiler/blob/main/tests/fixtures/cog_gcps_ovr.tif
- Simple COG: https://github.com/cogeotiff/rio-tiler/blob/main/tests/fixtures/cog_ovr.tif
Any updates on this? I think from a user's perspective being able to leverage overviews would be of great importance.