rioxarray
rioxarray copied to clipboard
Delayed/chunked opening (sentinel) SAFE data with bands as variables fails
Code Sample
Here, a cropped S2B_MSIL2A.SAFE.zip, containing only two 60 m bands.
import rioxarray
path_to_safe_zip = `S2B_MSIL2A.SAFE.zip`
safe_name = path_to_safe_zip.replace('.zip', '')
sds_path = 'SENTINEL2_L2A:/vsizip//{path_to_safe_zip}/{safe_name}/MTD_MSIL2A.xml:60m:EPSG_32611'
rds = rioxarray.open_rasterio(path, chunks=True, band_as_variable=True)
diff = (rds['band_1'] - rds['band_2']).compute()
Less abstract:
import rasterio
import rioxarray
from xarray.backends.file_manager import CachingFileManager,
from xarray.backends.locks import SerializableLock
RASTERIO_LOCK = SerializableLock()
manager = CachingFileManager(rasterio.open, path, mode='r', lock=RASTERIO_LOCK)
riods = manager.acquire()
dsr1 = rioxarray._io.SingleBandDatasetReader(riods, 0)
b1 = rioxarray.open_rasterio(dsr1, chunks=True).squeeze().drop_vars('band')
dsr2 = rioxarray._io.SingleBandDatasetReader(riods, 1)
b2 = rioxarray.open_rasterio(dsr2, chunks=True).squeeze().drop_vars('band')
diff = (b1-b2).compute()
Problem description
rioxarray fails to load subdatasets of sentinel2 L2A granules correctly when opened delayed/chunked and with band_as_variable=True.
Sentinel2 L2A is distributed in the SAFE format, described here. These granules contain 4 subdatasets; one for each band (10 m, 20 m, 60 m) and a (virtual) 10 m true color image (TCI) dataset. When pointed at the zipped SAFE folder, rioxarray loads 3 subdatasets (the 10 m subdataset is overwritten by the TCI sudataset, but that is a separate issue) and ignores the band_as_variable argument.
We can also point rioxarray.open_rasterio() directly to a subdataset by prepending SENTINEL2_L2A:/ and appending the subdataset's name, e.g. /MTD_MSIL2A.xml:60m:EPSG_32611 to the SAFE path.
When loading the granules with chunks=True and band_as_variable=True and performing computations on the delayed variable arrays (before loading them into memory), the data gets corrupted in the sense that all variable data is pointing to the last referenced variable data. This seems to be caused by the fact that the SingleBandDatasetReader instances of each variable is sharing the same name property (i.e. the path of the dataset), leaving dask to be unable to distinguish between them when loading the data.
Providing a disambiguation e.g. by appending the band id (bidx) to the name property of the SingleBandDatasetReader in rioxarray._io solves the issue:
@property
def name(self):
"""
str: name of the dataset. Usually the path.
"""
if isinstance(self._riods, rasterio.vrt.WarpedVRT):
return self._riods.src_dataset.name + "-" + str(self._bidx)
return self._riods.name + "-" + str(self._bidx)
Environment Information
rioxarray (0.15.2.dev0) deps:
rasterio: 1.3.9
xarray: 2024.1.1
GDAL: 3.8.4
GEOS: 3.12.1
PROJ: 9.3.1
PROJ DATA: /Users/griessban/Library/Application Support/proj:/Users/griessban/miniconda3/envs/rioxarray/share/proj:/Users/griessban/miniconda3/envs/rioxarray/share/proj
GDAL DATA: /Users/griessban/miniconda3/envs/rioxarray/share/gdal
Other python deps:
scipy: 1.12.0
pyproj: 3.6.1
System:
python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:01:35) [Clang 16.0.6 ]
executable: /Users/griessban/miniconda3/envs/rioxarray/bin/python
machine: macOS-14.4-arm64-arm-64bit
Thanks for the analysis and a solution. A MR with the fix is welcome.