torchgeo
torchgeo copied to clipboard
Retrieve Sentinel Imagery as corresponding input for Geospatial datasets
One thing that I find not so straight forward with TorchGeo is finding/using input imagery for the various Geospatial datasets that only provide labels (this could also just be me, as someone who is not from a geospatial data background).
In the Getting Started Tutorial for NAIP data, it seems like the NAIP tiles just conveniently work with the Chesapeake data. Thus, I was wondering how to retrieve Sentinel Imagery to use with the many different Geospatial datasets. In my thought process one might approach this with the following steps:
- define a desired polygon area of lat/lon coordinates
- split the large polygon area into smaller manageable polygons, so to not retrieve one gigantic image array at once
- use STAC API with Planetary Computer to retrieve Sentinel images for the smaller polygons that approximate the large polygon
- save the images with all necessary information to then
- use TorchGeo Sentinel2 class to define a dataset based on retrieved images
I made the following gist as an attempt to do the above steps 1-4, but am not sure about the following:
- Whether the approach to define a desired chip size is correct
- Resolution, to stackstac should be in crs units, and I found this defintion for 10m resolution in WGS84 projection, but the saved tif has resolution of only 1m
- How to find the proper image bounds that can be used to populate the tree index for Sentinel2 raster dataset?
I thought about adding this as a tutorial, or a script to use for Sentinel data if this makes sense. However, for that the steps of course would have to be correct, so I would be thankful for feedback or other suggestions on how to approach this in a different manner.
I think there are many approaches to do this, including the steps you listed above. @calebrob6 can confirm whether those steps/code look correct.
- split the large polygon area into smaller manageable polygons, so to not retrieve one gigantic image array at once
At first glance, I don't think this step is necessary. Most Sentinel/Landsat/NAIP/Maxar datasets are provided as a collection of tiles (10K x 10K px), not as a single map of the world. These tiles can be downloaded in bulk or one at a time.
Depending on which dataset you're interested in, data can be download from many possible locations:
- Planetary Computer (requires you to apply for an account and get approval) - Several datasets
- USGS Earth Explorer - Many different datasets
- Copernicus Open Access Hub - Sentinel-only
- Google Earth Engine, AWS, GCP, Azure, and other cloud providers - Several datasets
- etc.
Many of these tools have their own built-in search tools for specifying location, time, and other metadata like cloud cover percentage. I think a better question is how much data you need and how much storage you have. For larger applications, you'll likely need to use something like AWS/GCP/Azure that already have the data downloaded.
In the Getting Started Tutorial for NAIP data, it seems like the NAIP tiles just conveniently work with the Chesapeake data.
The reason we chose NAIP + Chesapeake is that they have the same spatial resolution (NAIP was actually used to create Chesapeake I believe). You can downsample/upsample other datasets to combine them, but you might lose information. So if you have a dataset like CDL (30 m/px), you're better off combining it with Landsat (30 m/px) than Sentinel.
Thanks for your fast reply.
- USGS Earth Explorer - Many different datasets
I first tried USGS Earth Explorer, but the GeoTiffs one can download there have 20x20m resolution and not 10m.
The reason we chose NAIP + Chesapeake is that they have the same spatial resolution (NAIP was actually used to create Chesapeake I believe). You can downsample/upsample other datasets to combine them, but you might lose information. So if you have a dataset like CDL (30 m/px), you're better off combining it with Landsat (30 m/px) than Sentinel.
Yes, the resolution part is clear. I meant more, how, one would know to download those 4 specific tiles as corresponding inputs for Chesapeake, and how to go about obtaining inputs in general.
I think a better question is how much data you need and how much storage you have.
Thanks, I think given the application of Deep Learning models to this kind of data, AWS/GCP/Azure is more what I should look into.
Why do you need to have the S2 tiles split up? (I think this will make the process much more complicated).
This may help -- I have the following method that takes a S2 Item STAC object (from the Planetary Computer) and some of the bands in COG format:
def write_geotiff_from_s2_item_url(output_fn, item_url, bands=["B04", "B03", "B02"]):
"""Creates a COG from a S2 item URL (found on the PC explorer).
Args:
output_fn: path to output file
item_url: STAC Item URL, can be found with the Planetary Computer explorer, e.g. 'https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2A_MSIL2A_20220108T083331_R021_T37TCN_20220110T143814'
bands: a sequence of S2 bands, defaults to RGB bands, some of ['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B11', 'B12', 'B8A']
"""
item = pystac.Item.from_file(item_url)
signed_item = pc.sign(item)
stack = stackstac.stack(
signed_item,
assets=bands,
resolution=10
)
stack = stack.compute()
num_times, num_channels, height, width = stack.shape
assert num_times == 1
profile = {
"driver": "COG",
"height": height,
"width": width,
"transform": stack.transform,
"dtype": "uint16",
"nodata": None,
"tiled": True,
"compress": "lzw",
"predictor": 2,
"interleave": "pixel",
"crs": stack.crs,
"count": num_channels
}
with rasterio.open(output_fn, "w", **profile) as f:
f.write(stack.data[0])
Ones you have local S2 files, then you can use them in whichever GeoDataset you'd like.
Why do you need to have the S2 tiles split up? (I think this will make the process much more complicated).
I don't think I need the S2 tiles split up. However, when wanting to retrieve imagery for a larger area for example:
and passing the lat/lon coordinates polygon to the intersects
argument for the catalog search, it will return a huge array that my machine cannot compute()
. Plus, since I would take the median over a specific time range to obtain a "cloud free" image, the dimension of the array is not just large in X and Y. Thus, my idea was to split up the first large area polygon as seen in the image into smaller ones that get passed to intersects,
to decrease the X and Y dimension that the machine can actually compute()
.
I recently had this need. Used eodag to find corresponing S2 tiles metadata, then downloaded the whole SAFE products from Google Storage. These SAFE products can then be read by Sentinel2 GeoDataset.
pseudo-code:
from eodag import EODataAccessGateway
dag = EODataAccessGateway()
for file in geospatial_files:
if is_vector(file):
gdf = geopandas.read_file(file)
extent = gdf.to_crs(4326).unary_union.wkt
elif is_raster(file):
# get extent using rasterio and
# use geopandas to convert to wgs84 (epsg:4326) and wkt
...
# Optionally extract timestamp from filename or set time range yourself
from_datetime = extract_timestamp_from_filename(file)
to_datetime = from_datetime + "10 minutes"
# other parameters to filter by
cloud_cover = 40
# Find product (ids)
products = dag.search_all(
"productType": "S2_MSI_L1C",
"start": from_datetime,
"end": to_datetime,
"geom": extent,
"cloudCover": cloud_cover,
)
# Download from google storage
remote_file = [get_gs_path_from_product_id(p.id) for p in products]
cmds = []
for remote_file in product_ids:
local_file = map_to_local_directory(remote_file)
cmd = f"gsutil -m cp -n {remote_file} {local_file}"
download_all = " & ".join(cmds)
stream = os.popen(download_all)
def get_gs_path_from_product_id(product_id):
tile_part_1 = product_id.split("_")[5][1:3]
tile_part_2 = product_id.split("_")[5][3:4]
tile_part_3 = product_id.split("_")[5][4:]
return "gs://gcp-public-data-sentinel-2/tiles/{}/{}/{}/{}.SAFE".format(
tile_part_1, tile_part_2, tile_part_3, product_id
)