kerchunk
kerchunk copied to clipboard
Unexpected error message when data source is unreachable
I successfully created a JSON index file for a NetCDF4 dataset using SingleHdf5ToZarr
. I wanted to see what would happen if I intentionally changed the templates.u
URI reference in the JSON file to a non-existent URI. This would simulate a situation where the dataset is moved after the index is created.
The index file is then read using this code:
import fsspec
import xarray as xr
import matplotlib.pyplot as plt
mapper = fsspec.get_mapper('reference://', fo="output2.json")
ds = xr.open_zarr(mapper, decode_times=False)
subset = ds['hus'].isel(plev=18).isel(time=1)
print(subset)
subset.plot()
plt.show()
I was expecting some kind of "file not found" type of error on line 7 (on open_zarr
) but instead I got a ValueError when calling subset.plot()
on line 12. The start of the error message is: "ValueError: The input coordinate is not sorted in increasing order along axis 0.".
Line 10 prints out the subset
object and shows the following:
Coordinates:
* lat (lat) float64 nan nan nan nan nan nan ... nan nan nan nan nan nan
* lon (lon) float64 nan nan nan nan nan nan ... nan nan nan nan nan nan
plev float64 nan
time float64 45.0
So it seems that instead of failing on opening the (unreachable) data file, it's trying to create the subset
object but assigning nan
s to the coordinate values. (Curiously, it's not assigning nan
to the time
coordinate.)
This is exected behaviour coming from zarr. It assumes that, when fetching a chunk, if there is an error (FileNotFound, typically), then the chunk is missing, and the values should be set to the replacement fill value, nan
here.
We do have the opportunity to maybe do better, since in ReferenceFileSystem, we can tell the difference between a missing reference (which is the equivalent of what zarr is after) and a reference that exists but fails to open. Perhaps the latter should raise an exception that is not caught by zarr.
Curiously, it's not assigning nan to the time coordinate.
Presumably this data is embedded in the reference file; but strange that plev
isn't (unless its value really is nan
).
mapper.fs.references["time/0"] # should be data, not a reference
Thanks @martindurant. I haven't really got to grips with how this stuff all works under the hood, but could it be that time
is embedded in the reference file because it's a shorter dimension than plev
? I used inline_threshold=100
.
By the way, I have written up my adventures with kerchunk here, in case it helps anybody else: https://github.com/jonblower/jon-kerchunk-test.
Just wanted to note that I ran into this issue recently as well
This is exected behaviour coming from zarr. It assumes that, when fetching a chunk, if there is an error (FileNotFound, typically), then the chunk is missing, and the values should be set to the replacement fill value, nan here.
FWIW, in my particular case, I was getting an error due to my cloud credentials only being valid in us-west-2
but I was running kerchunk outside this region AWS.
We do have the opportunity to maybe do better, since in ReferenceFileSystem, we can tell the difference between a missing reference (which is the equivalent of what zarr is after) and a reference that exists but fails to open. Perhaps the latter should raise an exception that is not caught by zarr.
+1 for this idea -- my sense is it would be more expected user experience