Coordinates lost with GPM-IMERG file
I'm working with the GPM-IMERG files from NASA. Here's an example
import xarray as xr
import fsspec
url = "https://earthmover-sample-data.s3.us-east-1.amazonaws.com/hdf5/3B-HHR.MS.MRG.3IMERG.19980101-S000000-E002959.0000.V07B.HDF5"
ds_nc = xr.open_dataset(fsspec.open(url).open(), engine="h5netcdf", group="Grid", decode_coords="all")
print(ds)
<xarray.Dataset> Size: 104MB
Dimensions: (time: 1, lon: 3600, lat: 1800, nv: 2,
lonv: 2, latv: 2)
Coordinates:
* time (time) object 8B 1998-01-01 00:00:00
* lon (lon) float32 14kB -179.9 -179.9 ... 179.9
* lat (lat) float32 7kB -89.95 -89.85 ... 89.95
time_bnds (time, nv) object 16B ...
lon_bnds (lon, lonv) float32 29kB ...
lat_bnds (lat, latv) float32 14kB ...
Dimensions without coordinates: nv, lonv, latv
Data variables:
precipitation (time, lon, lat) float32 26MB ...
randomError (time, lon, lat) float32 26MB ...
probabilityLiquidPrecipitation (time, lon, lat) float32 26MB ...
precipitationQualityIndex (time, lon, lat) float32 26MB ...
Attributes:
GridHeader: BinMethod=ARITHMETIC_MEAN;\nRegistration=CENTER;\nLatitudeRe...
When opening this dataset virtually, many of the coordinates are lost.
from virtualizarr import open_virtual_dataset
dsv = open_virtual_dataset(url, indexes={}, group="Grid")
print(dsv)
This spits out the following warning
[/srv/conda/envs/notebook/lib/python3.12/site-packages/kerchunk/hdf.py:547](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/kerchunk/hdf.py#line=546): UserWarning: The following excepion was caught and quashed while traversing HDF5
Can't get fill value (fill value is undefined)
Traceback (most recent call last):
File "[/srv/conda/envs/notebook/lib/python3.12/site-packages/kerchunk/hdf.py", line 363](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/kerchunk/hdf.py#line=362), in _translator
fill = h5obj.fillvalue
^^^^^^^^^^^^^^^
File "h5py[/_objects.pyx", line 54](https://hub.openveda.cloud/_objects.pyx#line=53), in h5py._objects.with_phil.wrapper
File "h5py[/_objects.pyx", line 55](https://hub.openveda.cloud/_objects.pyx#line=54), in h5py._objects.with_phil.wrapper
File "[/srv/conda/envs/notebook/lib/python3.12/site-packages/h5py/_hl/dataset.py", line 622](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/h5py/_hl/dataset.py#line=621), in fillvalue
self._dcpl.get_fill_value(arr)
File "h5py[/_objects.pyx", line 54](https://hub.openveda.cloud/_objects.pyx#line=53), in h5py._objects.with_phil.wrapper
RuntimeError: Can't get fill value (fill value is undefined)
warnings.warn(msg)
and returns a dataset with most of the coordinates missing
<xarray.Dataset> Size: 91MB
Dimensions: (time: 1, lon: 3600, lat: 1800, nv: 2)
Coordinates:
time (time) int32 4B ManifestArray<shape=(1,),...
Dimensions without coordinates: lon, lat, nv
Data variables:
precipitation (time, lon, lat) float32 26MB ManifestArr...
precipitationQualityIndex (time, lon, lat) float32 26MB ManifestArr...
probabilityLiquidPrecipitation (time, lon, lat) int16 13MB ManifestArray...
randomError (time, lon, lat) float32 26MB ManifestArr...
time_bnds (time, nv) int32 8B ManifestArray<shape=(...
Attributes:
GridHeader: BinMethod=ARITHMETIC_MEAN;\nRegistration=CENTER;\nLatitudeR...
coordinates: time
I also tried with the new kerchunk-free backend and got an error
from virtualizarr.readers.hdf import HDFVirtualBackend
open_virtual_dataset(url, indexes={}, group="Grid", drop_variables=["Intermediate"], backend=HDFVirtualBackend)
File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/readers/hdf/hdf.py:129](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/readers/hdf/hdf.py#line=128), in HDFVirtualBackend._dataset_chunk_manifest(path, dataset)
127 num_chunks = dsid.get_num_chunks()
128 if num_chunks == 0:
--> 129 raise ValueError("The dataset is chunked but contains no chunks")
130 shape = tuple(
131 math.ceil(a [/](https://hub.openveda.cloud/) b) for a, b in zip(dataset.shape, dataset.chunks)
132 )
133 paths = np.empty(shape, dtype=np.dtypes.StringDType) # type: ignore
ValueError: The dataset is chunked but contains no chunks
RuntimeError: Can't get fill value (fill value is undefined)
That's weird but it's an upstream issue in kerchunk. EDIT: Possibly related to https://github.com/fsspec/kerchunk/issues/177
returns a dataset with most of the coordinates missing
This is almost certainly another symptom of the fact that VirtualiZarr does not yet understand how to decode according to CF conventions in the same way that xarray does. Some of the those variables are present, they are just set as data variables instead of coordinates - that's the exact same bug as in issue #189. Other coordinates (lon_bnds & lat_bnds) are presumably created by xarray from metadata, rather then them actually existing in the file.
I have a in-progress PR to actually call xarray's internal CF decoding logic from within VirtualiZarr in https://github.com/zarr-developers/VirtualiZarr/pull/224.
I also tried with the new kerchunk-free backend and got an error
This rings a bell but I defer to @sharkinsspatial
Also we should probably have some more tests that explicitly check that all the same variables and coordinates are present for a set of different example real-world files, whether opened with virtualizarr or xarray.
This is almost certainly another symptom of the fact that VirtualiZarr does not yet understand how to decode according to CF conventions in the same way that xarray does.... Some of the those variables are present, they are just set as data variables instead of coordinates
I don't think that's the case here. The following variables are present in the underlying HDF5 dataset but don't appear anywhere in the virtualized dataset, either as coordinates or data variables: nv, lonv, latv, lon, lat, time_bnds, lon_bnds, lat_bnds. Here's a dump of the HDF5 variables, their chunk shapes, and the number of initialized chunks.
nv (2,) 0
lonv (2,) 0
latv (2,) 0
time (32,) 1
lon (3600,) 1
lat (1800,) 1
time_bnds (32, 2) 1
lon_bnds (3600, 2) 1
lat_bnds (1800, 2) 1
precipitation (1, 145, 1800) 25
randomError (1, 145, 1800) 25
probabilityLiquidPrecipitation (1, 291, 1800) 13
precipitationQualityIndex (1, 145, 1800) 25
The variables are just being skipped completely.
This actually seems to be an error in kerchunk? I made a kerchunk JSON of the file and the missing variables (lon, lat, lon_bnds, lat_bnds) do not even have a corresponding .zarray in the JSON.
from kerchunk.hdf import SingleHdf5ToZarr
import ujson
m = SingleHdf5ToZarr('3B-HHR.MS.MRG.3IMERG.19980101-S000000-E002959.0000.V07B.HDF5')
m2 = m.translate()
with open("3B-HHR.MS.MRG.3IMERG.19980101-S000000-E002959.0000.V07B.HDF5.json", "w") as f:
f.write(ujson.dumps(m2))
print(xr.open_dataset("3B-HHR.MS.MRG.3IMERG.19980101-S000000-E002959.0000.V07B.HDF5.json", engine="kerchunk", group="Grid"))
<xarray.Dataset> Size: 104MB
Dimensions: (Grid/time: 1, Grid/lon: 3600,
Grid/lat: 1800, Grid/nv: 2)
Coordinates:
time (Grid/time) object 8B ...
Dimensions without coordinates: Grid/time, Grid/lon, Grid/lat, Grid/nv
Data variables:
precipitation (Grid/time, Grid/lon, Grid/lat) float32 26MB ...
precipitationQualityIndex (Grid/time, Grid/lon, Grid/lat) float32 26MB ...
probabilityLiquidPrecipitation (Grid/time, Grid/lon, Grid/lat) float32 26MB ...
randomError (Grid/time, Grid/lon, Grid/lat) float32 26MB ...
time_bnds (Grid/time, Grid/nv) object 16B ...
Attributes:
GridHeader: BinMethod=ARITHMETIC_MEAN;\nRegistration=CENTER;\nLatitudeRe...
I made a kerchunk JSON of the file and the missing variables (lon, lat, lon_bnds, lat_bnds) do not even have a corresponding .zarray in the JSON.
That definitely sounds wrong, and would throw virtualizarr's kerchunk translator off completely.
But also this Grid/time should not happen either, it should be time. The group name should never appear in the dimension name like that. So even fsspec isn't reading these kerchunk references back correctly.
FWIW we got this file working fairly well with @sharkinsspatial's HDF-native reader: https://github.com/earth-mover/icechunk-nasa/blob/main/notebooks/write_virtual.ipynb