netcdf4-python
netcdf4-python copied to clipboard
netCDF4-python writes string (unicode) attributes as 1-d arrays, not scalars
This code writes a single string attribute to an HDF5 file using netCDF4:
# Python 3.4.3
In [1]: import netCDF4
In [3]: ds = netCDF4.Dataset('/Users/shoyer/Downloads/global-attr.nc', 'w')
In [4]: ds.units = 'days since 1900'
In [5]: ds.close()
In [7]: !h5dump /Users/shoyer/Downloads/global-attr.nc
HDF5 "/Users/shoyer/Downloads/global-attr.nc" {
GROUP "/" {
ATTRIBUTE "units" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
DATA {
(0): "days since 1900"
}
}
}
}
Here's code do to the same thing with h5py:
In [8]: import h5py
In [9]: f = h5py.File('/Users/shoyer/Downloads/global-attr-h5py.nc')
In [10]: f.attrs['units'] = 'days since 1900'
In [11]: f.close()
In [12]: !h5dump /Users/shoyer/Downloads/global-attr-h5py.nc
HDF5 "/Users/shoyer/Downloads/global-attr-h5py.nc" {
GROUP "/" {
ATTRIBUTE "units" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "days since 1900"
}
}
}
}
As you can see from the results of h5dump, netCDF4-python is writing the attribute as a "simple dataspace" which corresponds to a multi-dimensional array of 1-element:
https://www.hdfgroup.org/HDF5/doc/UG/UG_frame12Dataspaces.html
In fact, this is exactly what you get if you view the file created with netCDF4-python using h5py (to netCDF4-python and ncdump, they appear identical):
In [13]: f = h5py.File('/Users/shoyer/Downloads/global-attr.nc')
In [14]: f.attrs['units']
Out[14]: array([b'days since 1900'], dtype=object)
I believe netCDF4-python should be writing the attribute as a scalar, similarly to want it does if you write bytes (or a string on Python 2):
# python 2.7
In [11]: ds = netCDF4.Dataset('/Users/shoyer/Downloads/global-attr-py27.nc', 'w')
In [12]: ds.bytes_str = 'days since 1900'
In [13]: ds.unicode_str = u'days since 1900'
In [14]: ds.close()
In [15]: !h5dump /Users/shoyer/Downloads/global-attr-py27.nc
HDF5 "/Users/shoyer/Downloads/global-attr-py27.nc" {
GROUP "/" {
ATTRIBUTE "bytes_str" {
DATATYPE H5T_STRING {
STRSIZE 15;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "days since 1900"
}
}
ATTRIBUTE "unicode_str" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
DATA {
(0): "days since 1900"
}
}
}
}
Given that netCDF4-python is simply using the netCDF-C library's nc_put_att_string function, this may very well be a bug upstream in the netCDF-C library.
Seems like when nc_put_att_text is used, the result is stored as a scalar in the hdf5 file. If nc_put_att_string is used (when the string is unicode) a simple dataspace is created. Here's the relevant code snippet in _netCDF4.pyx:
if value_arr.dtype.char == 'U' and not is_netcdf3:
# a unicode string, use put_att_string (if NETCDF4 file).
ierr = nc_put_att_string(grp._grpid, varid, attname, 1, &datstring)
else:
ierr = nc_put_att_text(grp._grpid, varid, attname, lenarr, datstring)
I think you are right that this is due to how nc_put_att_string is implemented in the C library. It seems to be designed to write arrays of variable length strings.
Should I open a bug report for the C library, then?
Sure, wouldn't hurt. At the very least maybe we will found out why they chose to do it that way.