iris
iris copied to clipboard
Cannot save non-ASCII characters to NetCDF
🐛 Bug Report
From @gavinevans
Attempting to save a Cube including a string AuxCoord with non-ASCII characters (i.e. Unicode characters) raises the following exception:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe8' in position 0: ordinal not in range(128)
How To Reproduce
Steps to reproduce the behaviour:
import iris
from iris.coords import AuxCoord, DimCoord
from iris.cube import Cube
spot_index = DimCoord([0, 1], long_name='site_index', units=1)
station_name = AuxCoord(["Robièi", "Mühleberg"], long_name="station_name")
# This one works:
# station_name = AuxCoord(["Robiei", "Muhleberg"], long_name="station_name")
cube = Cube(
[3, 4],
dim_coords_and_dims=[(spot_index, 0)],
aux_coords_and_dims=[(station_name, 0)]
)
iris.save(cube, "tmp.nc")
Expected behaviour
Should save with no exception (as happens when using the commented line above).
Environment
- OS & Version: RHEL7
- Iris Version: tested with
v3.2.1.post0andv3.4.0
Additional context
Related:
- #4101
- #4412
I think the fix will hinge on allowing for the extra bytes needed to store encoded Unicode characters. We currently divide the length in 4, which I think means we are always assuming a Unicode string can be converted to an ASCII one:
https://github.com/SciTools/iris/blob/fc302c9c08c292cb2075d2dd249bcbdfacf08da8/lib/iris/fileformats/netcdf/saver.py#L1881-L1883
Changing this could have loading consequences too?
Expand for traceback with Iris v3.4
Traceback (most recent call last):
File ".../iris/lib/2023-01-03_gavin.py", line 17, in <module>
iris.save(cube, "tmp.nc")
File ".../iris/lib/iris/io/__init__.py", line 457, in save
saver(source, target, **kwargs)
File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 2754, in save
sman.write(
File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 755, in write
self._add_aux_coords(cube, cf_var_cube, cube_dimensions)
File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 1088, in _add_aux_coords
return self._add_inner_related_vars(
File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 1053, in _add_inner_related_vars
cf_name = self._create_generic_cf_array_var(
File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 1917, in _create_generic_cf_array_var
new_data[index_slice] = list(
UnicodeEncodeError: 'ascii' codec can't encode character '\xe8' in position 0: ordinal not in range(128)
Hey @gavinevans, we're currently a bit low on resources, is this something you'd be interested on working on?
In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.
If this issue is still important to you, then please comment on this issue and the stale label will be removed.
Otherwise this issue will be automatically closed in 28 days time.
This issue hasn't yet been resolved.
This new activity has prompted a very useful discussion in @SciTools/peloton:
NetCDF only supports ASCII (i.e. every character must be 1 byte). Iris could do something with non-ASCII characters, but it would be Iris specific - no other library would know how to interpret it.
We're quite uncomfortable making an explicit decision here, since the Iris devs are not exposed to all the possible user cases. Since there is no official convention here, we would prefer for individual users/teams to define their own encode/decode rules, since they alone know the specifics (e.g. how many bytes are needed). This would probably take the form of a bytes array (rather than a character array), with user-written functions to write and read correctly. @gavinevans @brhooper how does this sound?
If anyone is aware of an 'official' convention that Iris should follow, please speak up 😊
I am not sure I understand what you mean with "NetCDF only supports ASCII (i.e. every character must be 1 byte)". Is the problem specific to string/char auxiliary coordinate values?
I am not sure I understand what you mean with "NetCDF only supports ASCII (i.e. every character must be 1 byte)". Is the problem specific to string/char auxiliary coordinate values?
We believe you can use Unicode in NetCDF names and in string attributes, but NOT in any data arrays.
In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.
If this issue is still important to you, then please comment on this issue and the stale label will be removed.
Otherwise this issue will be automatically closed in 28 days time.
Will consider this for 3.15
Will consider this for 3.15
... by which I mean we should make it easy - and documented - for users to define their own encode/decode routine.