iris
iris copied to clipboard
Iris cannot read NetCDF4 strings stored in NetCDF variables
On behalf of an Iris User:
I'm having some trouble reading/writing NetCDF files with Iris 2.4 for datasets with string type variables. I thought I should bring what I've found to your attention and also in the hope that you might have solutions (the solution that I've found would require a small tweak to Iris' NetCDF reading - so not a viable solution for me currently). I’ve added code to demonstrate at the end of this email.
We need to read/write NetCDF files for meteorological station data. These include a list of station names, stored in a NetCDF variable of length equal to the number of stations. This corresponds to a data dimension for observations, structured in a 2D array, station index by time (not include in the example code). There are two possible approaches: using old style NetCDF character/byte arrays (undesirable - does not support special characters in our international station database) or NetCDF4 style unicode strings.
We can create an example cube as follows (you might need Python 3.7 for the ü in Düsseldorf):station_names = np.array([u'Exeter', u'London', u'Düsseldorf'])station_cube = iris.cube.Cube(station_names, long_name='station_names')
To save this cube I need to give iris.save a fill_value to use. The save also doesn’t work if the string data are stored in a masked array.
The fill value is again a problem when loading from the saved NetCDF file (see traceback at the end of this email). There appear to be two problems at the failing line in iris/fileformats/netcdf.py:1. netCDF4.default_fillvals contains no default entry for fill values for strings (other that S1 non-unicode type)2. cf_var.dtype.str[1:] fails because, on loading, the cf_var.dtype for the string data is of type str which does not have an 'str' attribute.
The failing line in iris.fileformats.netcdf._get_cf_var_data reads:fill_value = getattr(cf_var.cf_data, '_FillValue',netCDF4.default_fillvals[cf_var.dtype.str[1:]])
Problems arise in two places:
- netCDF4.default_fillvals contains no default entry for string types other than the S1 (non-unicode) dtype. This is the same problem that we had when saving without a fill_value argument set.
- cf_var.dtype.str[1:] fails because the cf_var.dtype for the loaded string data is of type str, which does not have an str attribute.
I tried a nasty hack to stop Iris from looking for a default fill_value at the failing line. This works around the problem and the cube loads without issue. This clearly this isn't a viable solution for me to implement and I’m sure that I’m missing other complexities.
I hope that this makes sense and is of some use to you. Our current solution involves over 10,000 individual NetCDF files, one for each station, as we can store Unicode strings in NetCDF attributes with no problem. The large overhead for I/O of lots of small NetCDF files is rather cumbersome in our application and for end users of the dataset.
Example code for station name I/Oimport numpy as npimport iris
filename = 'string_test.nc'
Setup a numpy array of station names to be saved. Umlaut may not work prior to python 3.7.#station_names = np.array([u'Exeter', u'London', u'Düsseldorf'])station_names = np.array([u'Exeter', u'London', u'Dusseldorf'])
Make our cube to save - station_names cannot be a masked array or iris.save fall overstation_cube = iris.cube.Cube(station_names, long_name='station_names')
Save and load to test. fill_value must be set or iris.save will fall over (no corresponding data type in netCDF4.default_fillvals).iris.save(station_cube, filename, fill_value='N/A')
Reload data. This failsloaded_station_cube = iris.load_cube(filename)
This returns the following traceback:Traceback (most recent call last):File "
This happens because doing cf_var.dtype
gives us str
rather than a numpy datatype. This causes issues in the save code, when it's determining the default fill value and when it's checking against itemsize of the dtype. It also causes issues in the load code once you've fixed the save code because the lookup fails similarly and if a naïve fix is applied it gives the dtype of the loaded cube as object
rather than <U9
or whatever the original dtype was.
In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.
If this issue is still important to you, then please comment on this issue and the stale label will be removed.
Otherwise this issue will be automatically closed in 28 days time.
This stale issue has been automatically closed due to a lack of community activity.
If you still care about this issue, then please either:
- Re-open this issue, if you have sufficient permissions, or
- Add a comment stating that this is still relevant and someone will re-open it on your behalf.