netcdf4-python
netcdf4-python copied to clipboard
support for HDF5 dimension scales with null dataspace
I would like to use netCDF4-python (as backend to Xarray) to read some HDF5 files, and am unable to do so. Attempting to read the files actually crashes Python. I've traced the problem to a dimension scale with null dataspace in the HDF5 files. I understand that not all HDF5 files are netCDF4 files, but I don't think they should crash Python.
And in this particular case, the HDF5 file seems perfectly interpretable. As an enhancement to netCDF4-python, you could interpret a dimension scale with null dataspace for what it is equivalent to in netCDF4, which is "a netCDF dimension but not a netCDF variable."
Here is a reproducible example of code that crashes Python. I'm not totally sure the problem isn't just a mismatch between the HDF5 libraries used, since both netCDF4-python and h5py package their own libraries. My installs built nothing from source.
% cat danger.py
from h5py import File
from netCDF4 import Dataset
with File('danger.h5', 'w') as group:
dataset = group.create_dataset('y', shape=(3,), dtype=float)
dimension = group.create_dataset('x', shape=None, dtype=int) # will crash python when read below
# dimension = group.create_dataset('x', shape=(3,), dtype=int) # creates misleading dataset
dimension.make_scale('x')
dataset.dims[0].attach_scale(dimension)
with Dataset('danger.h5') as group:
print(group)
% python danger.py
Assertion failed: (ndims), function get_scale_info, file hdf5open.c, line 1396.
zsh: abort python danger.py
Here is the complete h5dump of danger.h5 created by h5py. While it is not a netCDF4 file, I can't think of any reason netCDF4-python shouldn't interpret it correctly (as it does in the above code but using the commented line). It is a dimension that has no coordinates, which is valid in the netCDF4 model.
HDF5 "danger.h5" {
GROUP "/" {
DATASET "x" {
DATATYPE H5T_STD_I64LE
DATASPACE NULL
DATA {
}
ATTRIBUTE "CLASS" {
DATATYPE H5T_STRING {
STRSIZE 16;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "DIMENSION_SCALE"
}
}
ATTRIBUTE "NAME" {
DATATYPE H5T_STRING {
STRSIZE 2;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "x"
}
}
ATTRIBUTE "REFERENCE_LIST" {
DATATYPE H5T_COMPOUND {
H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
H5T_STD_U32LE "dimension";
}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
DATA {
(0): {
DATASET 0 "/y",
0
}
}
}
}
DATASET "y" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): 0, 0, 0
}
ATTRIBUTE "DIMENSION_LIST" {
DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
DATA {
(0): (DATASET 0 "/x")
}
}
}
}
}
Thank you for considering! Here are my versions ...
% pip list
Package Version
---------- -------
cftime 1.6.2
h5py 3.7.0
netCDF4 1.6.2
numpy 1.23.5
pip 22.1.2
setuptools 62.3.3
wheel 0.37.1
[notice] A new release of pip available: 22.1.2 -> 22.3.1
[notice] To update, run: pip install --upgrade pip
% python --version
Python 3.10.8
% sw_vers
ProductName: macOS
ProductVersion: 12.6.1
BuildVersion: 21G217
If there is a workaround for this, it has to happen in the netcdf-c library. Can you file this as an issue at https://github.com/Unidata/netcdf-c?
Thanks, @jswhit. Filed as above. Or do I need to repeat/update the description? I hesitate to without knowing C.
@jswhit Any idea why there has been no comment from the Unidata team on Unidata/netcdf-c#2571?