rioxarray Delayed/chunked opening (sentinel) SAFE data with bands as variables fails

Delayed/chunked opening (sentinel) SAFE data with bands as variables fails

Open NiklasPhabian opened this issue 1 year ago • 1 comments

Code Sample

Here, a cropped S2B_MSIL2A.SAFE.zip, containing only two 60 m bands.

import rioxarray

path_to_safe_zip = `S2B_MSIL2A.SAFE.zip`
safe_name = path_to_safe_zip.replace('.zip', '')
sds_path = 'SENTINEL2_L2A:/vsizip//{path_to_safe_zip}/{safe_name}/MTD_MSIL2A.xml:60m:EPSG_32611'

rds = rioxarray.open_rasterio(path, chunks=True, band_as_variable=True)
diff = (rds['band_1'] - rds['band_2']).compute()

Less abstract:

import rasterio
import rioxarray
from xarray.backends.file_manager import CachingFileManager,
from xarray.backends.locks import SerializableLock

RASTERIO_LOCK = SerializableLock()

manager = CachingFileManager(rasterio.open, path, mode='r', lock=RASTERIO_LOCK)
riods = manager.acquire()

dsr1 = rioxarray._io.SingleBandDatasetReader(riods, 0)
b1 = rioxarray.open_rasterio(dsr1, chunks=True).squeeze().drop_vars('band')

dsr2 = rioxarray._io.SingleBandDatasetReader(riods, 1)
b2 = rioxarray.open_rasterio(dsr2, chunks=True).squeeze().drop_vars('band')
diff = (b1-b2).compute()

Problem description

rioxarray fails to load subdatasets of sentinel2 L2A granules correctly when opened delayed/chunked and with band_as_variable=True.

Sentinel2 L2A is distributed in the SAFE format, described here. These granules contain 4 subdatasets; one for each band (10 m, 20 m, 60 m) and a (virtual) 10 m true color image (TCI) dataset. When pointed at the zipped SAFE folder, rioxarray loads 3 subdatasets (the 10 m subdataset is overwritten by the TCI sudataset, but that is a separate issue) and ignores the band_as_variable argument.

We can also point rioxarray.open_rasterio() directly to a subdataset by prepending SENTINEL2_L2A:/ and appending the subdataset's name, e.g. /MTD_MSIL2A.xml:60m:EPSG_32611 to the SAFE path.

When loading the granules with chunks=True and band_as_variable=True and performing computations on the delayed variable arrays (before loading them into memory), the data gets corrupted in the sense that all variable data is pointing to the last referenced variable data. This seems to be caused by the fact that the SingleBandDatasetReader instances of each variable is sharing the same name property (i.e. the path of the dataset), leaving dask to be unable to distinguish between them when loading the data.

Providing a disambiguation e.g. by appending the band id (bidx) to the name property of the SingleBandDatasetReader in rioxarray._io solves the issue:

    @property
    def name(self):
        """
        str: name of the dataset. Usually the path.
        """
        if isinstance(self._riods, rasterio.vrt.WarpedVRT):
            return self._riods.src_dataset.name + "-" + str(self._bidx)
        return self._riods.name + "-" + str(self._bidx)

Environment Information

rioxarray (0.15.2.dev0) deps:
  rasterio: 1.3.9
    xarray: 2024.1.1
      GDAL: 3.8.4
      GEOS: 3.12.1
      PROJ: 9.3.1
 PROJ DATA: /Users/griessban/Library/Application Support/proj:/Users/griessban/miniconda3/envs/rioxarray/share/proj:/Users/griessban/miniconda3/envs/rioxarray/share/proj
 GDAL DATA: /Users/griessban/miniconda3/envs/rioxarray/share/gdal

Other python deps:
     scipy: 1.12.0
    pyproj: 3.6.1
System:
    python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:01:35) [Clang 16.0.6 ]
executable: /Users/griessban/miniconda3/envs/rioxarray/bin/python
   machine: macOS-14.4-arm64-arm-64bit

Mar 25 '24 23:03 NiklasPhabian

Thanks for the analysis and a solution. A MR with the fix is welcome.

Mar 27 '24 20:03 snowman2

rioxarray rioxarray copied to clipboard

Delayed/chunked opening (sentinel) SAFE data with bands as variables fails

Code Sample

Problem description

Environment Information

rioxarray
rioxarray copied to clipboard