rioxarray icon indicating copy to clipboard operation
rioxarray copied to clipboard

open_rasterio fails when NetCDF subdataset contains more than three dimensions

Open atedstone opened this issue 3 years ago • 14 comments

Code Sample

Example file: ftp://ftp.climato.be/fettweis/MARv3.11/Greenland/ERA_1958-2019-10km/daily_10km/MARv3.11-20km-daily-ERA-10km-2016.nc

import rioxarray as rxr
ds = rxr.open_rasterio('MARv3.11.2-6km-daily-ERA5-2016.nc')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-a16497ea4e6b> in <module>
----> 1 mar = rxr.open_rasterio('MARv3.11.2-6km-ERA5/MARv3.11.2-6km-daily-ERA5-2016.nc', mask_and_scale=False, parse_coordinates=False)

~/miniconda3/envs/geospatial/lib/python3.7/site-packages/rioxarray/_io.py in open_rasterio(filename, parse_coordinates, chunks, cache, lock, masked, mask_and_scale, variable, group, default_name, **open_kwargs)
    689             lock=lock,
    690             masked=masked,
--> 691             mask_and_scale=mask_and_scale,
    692         )
    693

~/miniconda3/envs/geospatial/lib/python3.7/site-packages/rioxarray/_io.py in _load_subdatasets(riods, group, variable, parse_coordinates, chunks, cache, lock, masked, mask_and_scale)
    507             masked=masked,
    508             mask_and_scale=mask_and_scale,
--> 509             default_name=subdataset.split(":")[-1].lstrip("/").replace("/", "_"),
    510         )
    511         if shape not in dim_groups:

~/miniconda3/envs/geospatial/lib/python3.7/site-packages/rioxarray/_io.py in open_rasterio(filename, parse_coordinates, chunks, cache, lock, masked, mask_and_scale, variable, group, default_name, **open_kwargs)
    755         data = indexing.MemoryCachedArray(data)
    756
--> 757     result = DataArray(
    758         data=data, dims=(coord_name, "y", "x"), coords=coords, attrs=attrs, name=da_name
    759     )

~/miniconda3/envs/geospatial/lib/python3.7/site-packages/xarray/core/dataarray.py in __init__(self, data, coords, dims, name, attrs, indexes, fastpath)
    342             data = _check_data_shape(data, coords, dims)
    343             data = as_compatible_data(data)
--> 344             coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
    345             variable = Variable(dims, data, attrs, fastpath=True)
    346             indexes = dict(

~/miniconda3/envs/geospatial/lib/python3.7/site-packages/xarray/core/dataarray.py in _infer_coords_and_dims(shape, coords, dims)
    145                 "coordinate %s has dimensions %s, but these "
    146                 "are not a subset of the DataArray "
--> 147                 "dimensions %s" % (k, v.dims, dims)
    148             )
    149

ValueError: coordinate SECTOR1_1 has dimensions ('SECTOR1_1',), but these are not a subset of the DataArray dimensions ('TIME', 'y', 'x')

Problem description

open_rasterio fails to load subdatasets (DataArrays?) from NetCDFs with more than 1 additional dimension to X, Y. Expressed alternatively, the maximum number of dimensions is 3 (e.g. TIME, y, x). This failure halts all further loading of the NetCDF file, in effect rendering the NetCDF unopenable by open_rasterio.

Expected Output

With pure xarray:

data = xr.open_dataset('MARv3.11.2-6km-daily-ERA5-2016.nc')

<xarray.Dataset>
Dimensions:    (ATMLAY3_3: 1, SECTOR: 2, SECTOR1_1: 1, TIME: 366, X12_251: 240, Y20_465: 446)
Coordinates:
  * X12_251    (X12_251) float32 -756.00006 -750.00006 ... 672.00006 678.00006
  * Y20_465    (Y20_465) float32 -1188.0001 -1182.0001 ... 1476.0001 1482.0001
  * SECTOR     (SECTOR) float32 1.0 2.0
  * TIME       (TIME) datetime64[ns] 2016-01-01T12:00:00 ... 2016-12-31T12:00:00
  * SECTOR1_1  (SECTOR1_1) float32 1.0
  * ATMLAY3_3  (ATMLAY3_3) float32 0.99974793
Data variables:
    LON        (Y20_465, X12_251) float32 ...
    LAT        (Y20_465, X12_251) float32 ...
    SH         (Y20_465, X12_251) float32 ...
    SRF        (Y20_465, X12_251) float32 ...
    SLO        (Y20_465, X12_251) float32 ...
    CZ         (Y20_465, X12_251) float32 ...
    SAL        (Y20_465, X12_251) float32 ...
    VEG        (SECTOR, Y20_465, X12_251) float32 ...
    FRV        (SECTOR, Y20_465, X12_251) float32 ...
    AREA       (Y20_465, X12_251) float64 ...
    MSK        (Y20_465, X12_251) float32 ...
    DATE       (TIME) float32 ...
    cnst       float64 ...
    MM         (TIME) float32 ...
    DD         (TIME) float32 ...
    SMB        (TIME, SECTOR1_1, Y20_465, X12_251) float32 ...
    RU         (TIME, SECTOR1_1, Y20_465, X12_251) float32 ...
    SF         (TIME, Y20_465, X12_251) float32 ...
    RF         (TIME, Y20_465, X12_251) float32 ...
    ME         (TIME, SECTOR1_1, Y20_465, X12_251) float32 ...
    TT         (TIME, ATMLAY3_3, Y20_465, X12_251) float32 ...
    QQ         (TIME, ATMLAY3_3, Y20_465, X12_251) float32 ...
    UV         (TIME, ATMLAY3_3, Y20_465, X12_251) float32 ...
    SP         (TIME, Y20_465, X12_251) float32 ...
    SWD        (TIME, Y20_465, X12_251) float32 ...
    LWD        (TIME, Y20_465, X12_251) float32 ...
    SHF        (TIME, Y20_465, X12_251) float32 ...
    LHF        (TIME, Y20_465, X12_251) float32 ...
    AL2        (TIME, SECTOR1_1, Y20_465, X12_251) float32 ...
    ST2        (TIME, SECTOR1_1, Y20_465, X12_251) float32 ...
Attributes:
    history:      FERRET V7.43 (optimized)  9-Jun-20FERRET V7.43 (optimized) ...
    Conventions:  CF-1.6

Environment Information

rioxarray (0.1.0) deps:
  rasterio: 1.1.7
    xarray: 0.16.1
      GDAL: 3.1.3

Other python deps:
     scipy: 1.5.2
    pyproj: 2.6.1.post1

System:
    python: 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20)  [GCC 7.3.0]
executable: /home/tedstona/miniconda3/envs/geospatial/bin/python
   machine: Linux-3.10.0-1127.el7.x86_64-x86_64-with-centos-7.8.2003-Core

Installation method

conda

Conda environment information (if you installed with conda):


Environment (conda list):
gdal                      3.1.3            py37h518339e_0    conda-forge
libgdal                   3.1.3                h670eac6_0    conda-forge
rasterio                  1.1.7            py37ha3d844c_0    conda-forge
rioxarray                 0.1.0                      py_0    conda-forge
xarray                    0.16.1                     py_0    conda-forge


Details about conda and system ( conda info ):
$ conda info

     active environment : geospatial
    active env location : /home/tedstona/miniconda3/envs/geospatial
            shell level : 2
       user config file : /home/tedstona/.condarc
 populated config files : /home/tedstona/.condarc
          conda version : 4.8.3
    conda-build version : not installed
         python version : 3.7.3.final.0
       virtual packages : __glibc=2.17
       base environment : /home/tedstona/miniconda3  (writable)
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /home/tedstona/miniconda3/pkgs
                          /home/tedstona/.conda/pkgs
       envs directories : /home/tedstona/miniconda3/envs
                          /home/tedstona/.conda/envs
               platform : linux-64
             user-agent : conda/4.8.3 requests/2.23.0 CPython/3.7.3 Linux/3.10.0-1127.el7.x86_64 centos/7.8.2003 glibc/2.17
                UID:GID : 23002:23000
             netrc file : None
           offline mode : False

atedstone avatar Oct 21 '20 15:10 atedstone

This is currently not supported, but not against rioxarray adding support for it. The reason it currently does not exist is that it adds a bit of complexity when you have more than 3 dimensions: https://gdal.org/drivers/raster/netcdf.html#dimension

The dimensions inside the NetCDF file use the following rules: (Z,Y,X). If there are more than 3 dimensions, the driver will merge them into bands. For example if you have an 4 dimension arrays of the type (P, T, Y, X). The driver will multiply the last 2 dimensions (P*T). The driver will display the bands in the following order. It will first increment T and then P. Metadata will be displayed on each band with its corresponding T and P values.

Until support is added, I recommend opening your file with xarray.open_dataset. If this is something you would like to tackle, feel free to take a stab at it.

snowman2 avatar Oct 21 '20 15:10 snowman2

I think this would be simplified if rasterio added support for multidimensional arrays: https://gdal.org/drivers/raster/netcdf.html#multidimensional-api-support, however support for this hasn't been added yet and it doesn't appear to be on the roadmap.

snowman2 avatar Oct 21 '20 15:10 snowman2

Thanks for your informative replies. I'll think about next steps - I'd forgotten the nuances of GDAL in only supporting a subset of netCDF functionality.

This driver is intended only for importing remote sensing and geospatial datasets in form of raster images. If you want explore all data contained in NetCDF file you should use another tools. https://gdal.org/drivers/raster/netcdf.html

The background here is that I want to use rioxarray's clip() functionality on this and other netCDF files. But perhaps rioxarray is not the best way to achieve this.

atedstone avatar Oct 22 '20 07:10 atedstone

The background here is that I want to use rioxarray's clip() functionality on this and other netCDF files. But perhaps rioxarray is not the best way to achieve this.

There is some good news and some bad news here.

  • The good news is that the clip() and other functionality of rioxarray should work regardless of how you open the file. You just have to import rioxarray for the extension to be loaded.
  • The bad news is that it appears that you have only 2D latitude and longitude (LAT & LON). There are solutions for this in #135 and #114. But, rioxarray doesn't do much for you in those scenarios.

snowman2 avatar Oct 22 '20 13:10 snowman2

Looks like your x and y coordinates are in some kind of projected coordinate system:

>>> xds.X10_153
<xarray.DataArray 'X10_153' (X10_153: 144)>
array([-760., -750., -740., -730., -720., -710., -700., -690., -680., -670.,
       -660., -650., -640., -630., -620., -610., -600., -590., -580., -570.,
       -560., -550., -540., -530., -520., -510., -500., -490., -480., -470.,
       -460., -450., -440., -430., -420., -410., -400., -390., -380., -370.,
       -360., -350., -340., -330., -320., -310., -300., -290., -280., -270.,
       -260., -250., -240., -230., -220., -210., -200., -190., -180., -170.,
       -160., -150., -140., -130., -120., -110., -100.,  -90.,  -80.,  -70.,
        -60.,  -50.,  -40.,  -30.,  -20.,  -10.,    0.,   10.,   20.,   30.,
         40.,   50.,   60.,   70.,   80.,   90.,  100.,  110.,  120.,  130.,
        140.,  150.,  160.,  170.,  180.,  190.,  200.,  210.,  220.,  230.,
        240.,  250.,  260.,  270.,  280.,  290.,  300.,  310.,  320.,  330.,
        340.,  350.,  360.,  370.,  380.,  390.,  400.,  410.,  420.,  430.,
        440.,  450.,  460.,  470.,  480.,  490.,  500.,  510.,  520.,  530.,
        540.,  550.,  560.,  570.,  580.,  590.,  600.,  610.,  620.,  630.,
        640.,  650.,  660.,  670.], dtype=float32)
Coordinates:
  * X10_153  (X10_153) float32 -760.0 -750.0 -740.0 -730.0 ... 650.0 660.0 670.0
Attributes:
    units:          km
    long_name:      x
    point_spacing:  even
    axis:           X

In this scenario, if you know what the CRS is of the projected data, you could use rio.clip or rio.clip_box if you assign the CRS to the dataset using rio.write_crs first.

snowman2 avatar Oct 22 '20 13:10 snowman2

Excellent. After some more digging, I've also seen your Stack Exchange answer which shows the use of write_crs in more detail. I can see about adding a page to the rasterio documentation which makes it clear that existing xarray Datasets can have a CRS added post-hoc.

atedstone avatar Oct 22 '20 15:10 atedstone

The model outputs that I've linked to in this issue do indeed have some peculiar properties, amongst which are insufficient attributes for most software to understand their georeferencing; units in kms; and X and Y names set based on clipping of the model domain in post-processing. So it's probably unreasonable to expect such files to ever be read in automatically by rioxarray.

atedstone avatar Oct 22 '20 16:10 atedstone

Related: https://github.com/rasterio/rasterio/issues/1759

snowman2 avatar Oct 27 '22 03:10 snowman2

I think this is related, but in this case these "4D arrays" are degenerate in z and t, they are really 2D grids with a time stamp. There's a different error for the remote file vs local:

## remote 
import rioxarray
dsn = "NETCDF:/vsicurl/https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202202/oisst-avhrr-v02r01.20220218.nc:sst"

rioxarray.open_rasterio(dsn)
KeyError: 'NETCDF_DIM_time_VALUES'


## local
import rioxarray
dsn = "NETCDF:oisst-avhrr-v02r01.20220218.nc:sst"
rioxarray.open_rasterio(dsn)

ValueError: coordinate zlev has dimensions ('zlev',), but these are not a subset of the DataArray dimensions ('time', 'y', 'x')

I'm certainly interested to have this case work in rioxarray and will explore it. Would it be acceptable to allow these? I've been down this path before (in commercial and open software) where the GDAL band metadata is used to re-store the dimensionality, but it's a mess and I don't think it's worth it given the xarray facilities and GDAL multidim that exist now.

I'm not interested in the multidimensional possibilities, I just want that obvious case where sequential bands or sequential files are lined up as "time", or perhaps "z". (If rioxarray won't work with them I'll probably make a copy as COGs and use that).

To show that GDAL sees these degenerate dims:

dataset = "/vsicurl/https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202202/oisst-avhrr-v02r01.20220218.nc"
from osgeo import gdal
gdal.Open(dataset).GetMetadata("SUBDATASETS")

gdal.Open(dataset).GetMetadata("SUBDATASETS")["SUBDATASET_1_DESC"]
'[1x1x720x1440] sst (16-bit integer)'

rasterio works with them fine


from rasterio.windows import Window
import rasterio
dsn = "NETCDF:oisst-avhrr-v02r01.20220218.nc:sst"
ds = rasterio.open(dsn)
ds.read(window = Window(0, 0, 2, 5))

dsn = "NETCDF:/vsicurl/https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202202/oisst-avhrr-v02r01.20220218.nc:sst"
ds = rasterio.open(dsn)
ds.read(window = Window(0, 0, 2, 5))

mdsumner avatar Jan 07 '24 00:01 mdsumner

sure enough, so at line 1234 of rioxarray/rioxarray/_io.py I see

(1, 720, 1440)
dict_keys(['time', 'zlev', 'x', 'y'])
<xarray.IndexVariable 'zlev' (zlev: 1)>
array([0])
<xarray.IndexVariable 'time' (time: 1)>
array([16119])```

so, I think in this case it could unproblematically drop the zlev, but I can't see yet how to identify that it's the dim being left out. Will keep pursuing, I'm almost entirely unexperienced with python debugging but happy with progress so far 🙏

mdsumner avatar Jan 07 '24 02:01 mdsumner

candidate fix here, at least for my issue: https://github.com/mdsumner/rioxarray/tree/fix-degen-dims

it works for the file locally, but there's still that error for the remote dsn which I'll continue with.

if reasonable I'll PR it - I think the linting is right, it will take me a while longer to get tests going. Thanks!

note that since GDAL 3.7 we can add this config to get the crs assumed under reasonable situations:

import rioxarray
import rasterio
dsn = "NETCDF:oisst-avhrr-v02r01.20220218.nc:sst"
with rasterio.Env(GDAL_NETCDF_ASSUME_LONGLAT="YES"):
    x = rioxarray.open_rasterio(dsn)

x.spatial_ref
<xarray.DataArray 'spatial_ref' ()>
array(0)
Coordinates:
    spatial_ref  int64 0
Attributes:
    crs_wkt:                      GEOGCS["WGS 84 (CRS84)",DATUM["WGS_1984",SP...
...

@atedstone is it possible to get your file listed above still? I'd like to explore

mdsumner avatar Jan 07 '24 03:01 mdsumner

@mdsumner The file I linked to above is no longer available, but this one is broadly equivalent:

ftp://ftp.climato.be/fettweis/MARv3.14/Greenland/ERA5-15km/MARv3.14.0-15km-daily-ERA5-2023.nc

atedstone avatar Jan 08 '24 09:01 atedstone

I think this is related, but in this case these "4D arrays" are degenerate in z and t, they are really 2D grids with a time stamp.

This specific use case should be simple to support as it removed the complexity mentioned here: https://github.com/corteva/rioxarray/issues/174#issuecomment-713666696

However, it would need to throw errors if the extra dimensions have a size greater than 1 until the reading capability is updated to support that.

snowman2 avatar Jan 08 '24 17:01 snowman2

I also encountered this issue. Since I have a very small dataset with 1D coordinates (but 4 dimensions), this simple reproducible example might help.

import rioxarray
import xarray as xr

file_nc = "cmems_so_2022-11-02.nc"
file_nc_reduced = file_nc.replace(".nc","_reduced.nc")

# create a dataset with the depth dimension reduced (only time/latitude/longitude remain)
ds = xr.open_dataset(file_nc)
ds_reduced = ds.max(dim="depth")
ds_reduced.to_netcdf(file_nc_reduced)

# this reduced dataset works
raster_red = rioxarray.open_rasterio(file_nc_reduced)

# the original dataset raises "ValueError: conflicting sizes for dimension 'time': length 50 on the data but length 1 on coordinate 'time'"
raster_full = rioxarray.open_rasterio(file_nc)

What might at least be useful is to improve the ValueError, since to me it was not clear that there was one dimension too much. It looked more like a mismatch in dimension orders. After trying transposing and swap_dims, I got the idea to reduce the depth dimension (with max, which is arbitrary in this case).

The ds looks like this:

<xarray.Dataset>
Dimensions:    (depth: 50, latitude: 11, longitude: 10, time: 1)
Coordinates:
  * depth      (depth) float32 0.494 1.541 2.646 ... 5.275e+03 5.728e+03
  * latitude   (latitude) float32 -1.583 -1.5 -1.417 ... -0.9167 -0.8333 -0.75
  * longitude  (longitude) float32 116.4 116.5 116.6 116.7 ... 117.0 117.1 117.2
  * time       (time) datetime64[ns] 2022-11-02
Data variables:
    so         (time, depth, latitude, longitude) float32 nan nan ... nan nan

The ds_reduced looks like this:

<xarray.Dataset>
Dimensions:    (latitude: 11, longitude: 10, time: 1)
Coordinates:
  * latitude   (latitude) float32 -1.583 -1.5 -1.417 ... -0.9167 -0.8333 -0.75
  * longitude  (longitude) float32 116.4 116.5 116.6 116.7 ... 117.0 117.1 117.2
  * time       (time) datetime64[ns] 2022-11-02
Data variables:
    so         (time, latitude, longitude) float32 nan nan 32.86 ... nan nan nan

Example dataset (extension changed to enable uploading): cmems_so_2022-11-02.nc.txt

veenstrajelmer avatar Apr 03 '24 14:04 veenstrajelmer