xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Dimension error when accessing OPeNDAP data

Open zxdawn opened this issue 9 months ago • 6 comments

What happened?

I was trying to access the VIIRS L1 data via xarray, but I got a dimension error.

What did you expect to happen?

Works well like pydap:

from pydap.net import create_session
from pydap.client import open_url

my_session = create_session()
url_VNP02IMG = 'https://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/5200/VNP02IMG/2024/001/VNP02IMG.A2024001.0000.002.2024006095039.nc'

ds_VNP02IMG = open_url(url_VNP02IMG, session=my_session, protocol='dap4')

Minimal Complete Verifiable Example

import xarray as xr
from pydap.net import create_session

my_session = create_session()
url_VNP02IMG = 'https://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/5200/VNP02IMG/2024/001/VNP02IMG.A2024001.0000.002.2024006095039.nc'

ds_pydap = xr.open_dataset(url_VNP02IMG, session=my_session, engine="pydap", decode_times=False, decode_cf=False)

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 1
----> 1 ds_pydap = xr.open_dataset(url_VNP02IMG, session=my_session, engine="pydap", decode_times=False, decode_cf=False)

File ~/opt/miniconda3/envs/viirs/lib/python3.12/site-packages/xarray/backends/api.py:687, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    675 decoders = _resolve_decoders_kwargs(
    676     decode_cf,
    677     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)    683     decode_coords=decode_coords,
    684 )
    686 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 687 backend_ds = backend.open_dataset(
    688     filename_or_obj,
    689     drop_variables=drop_variables,
    690     **decoders,
    691     **kwargs,
    692 )
    693 ds = _dataset_from_backend_dataset(
    694     backend_ds,
    695     filename_or_obj,
   (...)    705     **kwargs,
    706 )
    707 return ds
...
    511     )
    512 if len(set(dims)) < len(dims):
    513     repeated_dims = {d for d in dims if dims.count(d) > 1}

ValueError: dimensions () must have the same length as the number of data dimensions, ndim=1

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None python: 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:19:24) [Clang 18.1.8 ] python-bits: 64 OS: Darwin OS-release: 24.5.0 machine: arm64 processor: arm byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None

xarray: 2025.4.0 pandas: 2.2.3 numpy: 2.2.6 scipy: 1.15.2 netCDF4: None pydap: 3.5.5 h5netcdf: None h5py: None zarr: 2.18.7 cftime: None nc_time_axis: None iris: None bottleneck: None dask: 2025.5.1 distributed: 2025.5.1 matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2025.5.1 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 80.9.0 pip: 25.1.1 conda: None pytest: None mypy: None IPython: 9.3.0 sphinx: None

zxdawn avatar Jun 04 '25 13:06 zxdawn

cc @Mikejmnez

dcherian avatar Jun 04 '25 13:06 dcherian

Hi @zxdawn I took a look at that dataset. Indeed I can access with pydap, and if you print the tree it shows that there are no variables at the root level, but there are 2 Groups. You can also see this on the browser by appending a .dmr to your url (here - may require edl)

py_ds.tree()
.VNP02IMG.A2024001.0000.002.2024006095039.nc
├──scan_line_attributes
│  ├──scan_start_time
│  ├──scan_end_time
│  ├──scan_state_flags
│  ├──ev_mid_time
│  └──scan_quality_flags
└──observation_data
   ├──I02_uncert_index
   ├──I03_quality_flags
   ├──I04
   ├──I04_quality_flags
   ├──I05_brightness_temperature_lut
   ├──I03_uncert_index
   ├──I02
   ├──I02_quality_flags
   ├──I04_brightness_temperature_lut
   ├──I03
   ├──I01_quality_flags
   ├──I05
   ├──I01
   ├──I04_uncert_index
   ├──I05_uncert_index
   ├──I05_quality_flags
   └──I01_uncert_index

You correctly set the protocol =''dap4'' with pydap directly, which enables Groups. Alternatively you replace the scheme of the url with dap4 (so "dap4://") to set the protocol. This approach is widely used across APIs. Replacing the scheme with dap protocol is how to set the protocol dap4 when using pydap behind xarray. Without setting dap4 as the protocol, for backwards compatibility with older opendap servers it reverts to DAP2 (since anything dap2 is translatable to dap4 and so a dap4 server will produce a dap2 response). DAP2 does not know about hierarchy in files, and that is likely why you get that error.

Finally, because there are groups, the correct way to approach is use xarray's DataTree.

working example

url_VNP02IMG = 'dap4://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/5200/VNP02IMG/2024/001/VNP02IMG.A2024001.0000.002.2024006095039.nc'


ds = xr.open_datatree(url_VNP02IMG, session=my_session, engine="pydap", decode_times=False, decode_cf=False)

Alternatively, if you are only interested in a single group, you can modify your url (add a constraint expression), or define the Group of interest when creating the xarray dataset.

For example, you only are interested in the group called "observational_data".

ds = xr.open_dataset(url_VNP02IMG, group='observation_data', session=my_session, engine="pydap", decode_times=False, decode_cf=False)

that should work

Mikejmnez avatar Jun 04 '25 15:06 Mikejmnez

Thanks a lot @Mikejmnez The dap4 method works well! BTW, is it possible to download VIIRS L1 data for a subset region? I see the L1 data doesn't include lon and lat as dimensions, the subset method wouldn't work. Or maybe we can calculate the index subset ourselves and add it to the URL?

Method 1 (using GEO file)

I'm thinking about using the GEO file to mask the data like this:

url_VNP03IMG = 'dap4://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/5200/VNP03IMG/2024/001/VNP03IMG.A2024001.0000.002.2024006071344.nc'

ds_VNP03IMG = xr.open_dataset(url_VNP03IMG, group='geolocation_data', session=my_session, engine="pydap", decode_times=False, decode_cf=False)

lat_target = 14.803078
lon_target = -17.31260

mask = (ds_VNP03IMG['longitude'] > lon_target-0.5) & (ds_VNP03IMG['longitude'] < lon_target+0.5) & (ds_VNP03IMG['latitude'] > lat_target-0.5) & (ds_VNP03IMG['latitude'] < lat_target+0.5)

However, creating the mask would take around 25 s, which is almost half of accessing the whole VNP02IMG data.

Method 2 (accessing necessary variables)

url_VNP02IMG = 'dap4://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/5200/VNP02IMG/2024/001/VNP02IMG.A2024001.0000.002.2024006095039.nc'
CE = "?dap4.ce=/I01"
url_VNP02IMG_var = url_VNP02IMG + CE

ds_var = open_url(url_VNP02IMG_var, session=my_session, protocol='dap4')
ds_var = xr.open_dataset(url_VNP02IMG_var, group='observation_data', session=my_session, engine="pydap", decode_times=False, decode_cf=False)

It seems the url doesn't work ...

zxdawn avatar Jun 05 '25 02:06 zxdawn

@zxdawn glad it is working, level 1 data can be tricky, and in this scenario I don't know what the best answer is without spending time playing with the dataset. I have a feeling that there will be some performance gain in masking (you'll probably have to do this for each swath, since lat/lon vary across granules), but the biggest performance will come from pre-selecting the variables of interest.

(Also, thanks for taking the time looking at the pydap docs. They need some updating :) !)

Method 1

I like your approach, and indeed things slow down when you mask different variables.. The timing you are seeing is likely due to xarray+pydap downloading the 2 arrays (lat and lon) into memory (behind authentication). And that can be slower than downloading metadata (i.e. time it takes to generate the xarray dataset in the first place). So 2 different things.

Method 2

If you add the group into your url, things should work. Groups act like folders and a variable is like a file inside the folder (that is one reason behind the use of '/' in dap4). So the CE in your url should specify the file location inside your root directory. That is:

CE = "?dap4.ce=/observation_data/I01" # I01 is inside observational_data

Then you use xarray, and instruct it to look inside the group.

ds = xr.open_dataset(url_VNP02IMG_var, group='observation_data', engine='pydap', decode_times=False, decode_cf=False)

That should substantially cut the time to generate the metadata.

Method 3


url_geo = "dap4://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/5200/VNP03IMG/2024/001/VNP03IMG.A2024001.0000.002.2024006071344.nc"
geo_ce = "?dap4.ce=/geolocation_data/longitude;/geolocation_data/latitude"
url_geo_ce = url_geo + geo_ce

data = xr.open_dataset(url_VNP02IMG_var, session=my_session, group='observation_data', engine='pydap', decode_times=False, decode_cf=False)
grid = xr.open_dataset(url_geo_ce, group='geolocation_data', session=my_session, engine='pydap', decode_times=False, decode_cf=False)
ds = xr.merge([data, grid]).set_coords(['latitude', 'longitude'])

The ds only has I01, latitude and longitude. You can then use xarray to mask the entire dataset, based on the values of latitude and longitude. It will be faster than opening the entire dataset and then dropping variables, but masking will still require downloading 2 variables into memory and then mask them. Since download is in parallel, and in this scenario there are only 3 arrays, it may take the same amount of time as downloading the entire ds...

Mikejmnez avatar Jun 05 '25 04:06 Mikejmnez

oh - and if you use

my_session=create_session(use_cache=True) 

If you decide to figure out a subset using xarray or pydap, the lat and lon values will be cached and that can avoid in some cases re-downloading the same data every time multiple times (& you can even use token auth, which can avoid some redirects as well)

Mikejmnez avatar Jun 05 '25 04:06 Mikejmnez

Thanks a lot @Mikejmnez ! use_cache indeed saves a lot of time when rerunning the code. The token auth doesn't improve the speed for the VIIRS case.

Regarding your method 3, it's really fast (~ 4s). However, loading the data like ds['longitude'].load() is still slow (~ 30s). This explains why creating a mask is slow.

One interesting thing I found is that if I search and download both data using earthaccess, it's only 20 s. Perhaps they apply a parallel downloading method in the background.


Oh, it seems the data sizes are different:

http_cache.sqlite is the pydap cache while the two nc files are downloaded by earthaccess.

Image
from pydap.net import create_session
from pydap.client import open_url
import xarray as xr
import earthaccess

# ---- earthaccess ---- #
short_name_L1 = ['VNP02IMG', 'VNP03IMG']
startDate = '2024-01-01 00:00:00'
endDate = '2024-01-01 10:59:59'
lat_target = 14.803078
lon_target = -17.31260

search_params = {
    "short_name": short_name_L1,
    "temporal": (startDate, endDate),
    "point": (lon_target, lat_target),
}

results = earthaccess.search_data(**search_params)

# download VNP02IMG and VNP03IMG for one datetime
data_comb = [results[0], results[1]]

download_files = earthaccess.download(data_comb, './')

# ---- pydap/xarray ---- #

my_session = create_session(session_kwargs=session_extra, use_cache=True, cache_kwargs={'cache_name': './http_cache'})

url_geo = "dap4://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/5200/VNP03IMG/2024/001/VNP03IMG.A2024001.0236.002.2024006071519.nc"
geo_ce = "?dap4.ce=/geolocation_data/longitude;/geolocation_data/latitude"
url_geo_ce = url_geo + geo_ce

url_VNP02IMG = 'dap4://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/5200/VNP02IMG/2024/001/VNP02IMG.A2024001.0236.002.2024006095448.nc'
CE = "?dap4.ce=/observation_data/I05"
url_VNP02IMG_var = url_VNP02IMG + CE

data = xr.open_dataset(url_VNP02IMG_var, session=my_session, group='observation_data', engine='pydap', decode_times=False, decode_cf=False)
grid = xr.open_dataset(url_geo_ce, group='geolocation_data', session=my_session, engine='pydap', decode_times=False, decode_cf=False)
ds = xr.merge([data, grid]).set_coords(['latitude', 'longitude'])

mask = (ds['longitude'] > lon_target - 2) & (ds['longitude'] < lon_target + 2) & (ds['latitude'] > lat_target - 2) & (ds['latitude'] > lat_target + 2)

zxdawn avatar Jun 05 '25 12:06 zxdawn