kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Unexpected error message when data source is unreachable

Open jonblower opened this issue 2 years ago • 3 comments

I successfully created a JSON index file for a NetCDF4 dataset using SingleHdf5ToZarr. I wanted to see what would happen if I intentionally changed the templates.u URI reference in the JSON file to a non-existent URI. This would simulate a situation where the dataset is moved after the index is created.

The index file is then read using this code:

import fsspec
import xarray as xr
import matplotlib.pyplot as plt

mapper = fsspec.get_mapper('reference://', fo="output2.json")

ds = xr.open_zarr(mapper, decode_times=False)

subset = ds['hus'].isel(plev=18).isel(time=1)
print(subset)

subset.plot()
plt.show()

I was expecting some kind of "file not found" type of error on line 7 (on open_zarr) but instead I got a ValueError when calling subset.plot() on line 12. The start of the error message is: "ValueError: The input coordinate is not sorted in increasing order along axis 0.".

Line 10 prints out the subset object and shows the following:

Coordinates:
  * lat      (lat) float64 nan nan nan nan nan nan ... nan nan nan nan nan nan
  * lon      (lon) float64 nan nan nan nan nan nan ... nan nan nan nan nan nan
    plev     float64 nan
    time     float64 45.0

So it seems that instead of failing on opening the (unreachable) data file, it's trying to create the subset object but assigning nans to the coordinate values. (Curiously, it's not assigning nan to the time coordinate.)

jonblower avatar Jul 19 '22 11:07 jonblower

This is exected behaviour coming from zarr. It assumes that, when fetching a chunk, if there is an error (FileNotFound, typically), then the chunk is missing, and the values should be set to the replacement fill value, nan here.

We do have the opportunity to maybe do better, since in ReferenceFileSystem, we can tell the difference between a missing reference (which is the equivalent of what zarr is after) and a reference that exists but fails to open. Perhaps the latter should raise an exception that is not caught by zarr.

Curiously, it's not assigning nan to the time coordinate.

Presumably this data is embedded in the reference file; but strange that plev isn't (unless its value really is nan).

mapper.fs.references["time/0"]  # should be data, not a reference

martindurant avatar Jul 19 '22 13:07 martindurant

Thanks @martindurant. I haven't really got to grips with how this stuff all works under the hood, but could it be that time is embedded in the reference file because it's a shorter dimension than plev? I used inline_threshold=100.

By the way, I have written up my adventures with kerchunk here, in case it helps anybody else: https://github.com/jonblower/jon-kerchunk-test.

jonblower avatar Jul 20 '22 08:07 jonblower

Just wanted to note that I ran into this issue recently as well

This is exected behaviour coming from zarr. It assumes that, when fetching a chunk, if there is an error (FileNotFound, typically), then the chunk is missing, and the values should be set to the replacement fill value, nan here.

FWIW, in my particular case, I was getting an error due to my cloud credentials only being valid in us-west-2 but I was running kerchunk outside this region AWS.

We do have the opportunity to maybe do better, since in ReferenceFileSystem, we can tell the difference between a missing reference (which is the equivalent of what zarr is after) and a reference that exists but fails to open. Perhaps the latter should raise an exception that is not caught by zarr.

+1 for this idea -- my sense is it would be more expected user experience

jrbourbeau avatar Aug 21 '23 16:08 jrbourbeau