iris Cannot save non-ASCII characters to NetCDF

🐛 Bug Report

From @gavinevans

Attempting to save a Cube including a string AuxCoord with non-ASCII characters (i.e. Unicode characters) raises the following exception:

UnicodeEncodeError: 'ascii' codec can't encode character '\xe8' in position 0: ordinal not in range(128)

How To Reproduce

Steps to reproduce the behaviour:

import iris
from iris.coords import AuxCoord, DimCoord
from iris.cube import Cube

spot_index = DimCoord([0, 1], long_name='site_index', units=1)

station_name = AuxCoord(["Robièi", "Mühleberg"], long_name="station_name")
# This one works:
# station_name = AuxCoord(["Robiei", "Muhleberg"], long_name="station_name")

cube = Cube(
    [3, 4],
    dim_coords_and_dims=[(spot_index, 0)],
    aux_coords_and_dims=[(station_name, 0)]
)

iris.save(cube, "tmp.nc")

Expected behaviour

Should save with no exception (as happens when using the commented line above).

Environment

OS & Version: RHEL7
Iris Version: tested with v3.2.1.post0 and v3.4.0

Additional context

#4101
#4412

I think the fix will hinge on allowing for the extra bytes needed to store encoded Unicode characters. We currently divide the length in 4, which I think means we are always assuming a Unicode string can be converted to an ASCII one:

https://github.com/SciTools/iris/blob/fc302c9c08c292cb2075d2dd249bcbdfacf08da8/lib/iris/fileformats/netcdf/saver.py#L1881-L1883

Changing this could have loading consequences too?

Expand for traceback with Iris v3.4

Traceback (most recent call last):
  File ".../iris/lib/2023-01-03_gavin.py", line 17, in <module>
    iris.save(cube, "tmp.nc")
  File ".../iris/lib/iris/io/__init__.py", line 457, in save
    saver(source, target, **kwargs)
  File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 2754, in save
    sman.write(
  File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 755, in write
    self._add_aux_coords(cube, cf_var_cube, cube_dimensions)
  File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 1088, in _add_aux_coords
    return self._add_inner_related_vars(
  File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 1053, in _add_inner_related_vars
    cf_name = self._create_generic_cf_array_var(
  File ".../iris/lib/iris/fileformats/netcdf/saver.py", line 1917, in _create_generic_cf_array_var
    new_data[index_slice] = list(
UnicodeEncodeError: 'ascii' codec can't encode character '\xe8' in position 0: ordinal not in range(128)

Jan 04 '23 11:01 trexfeathers

Hey @gavinevans, we're currently a bit low on resources, is this something you'd be interested on working on?

Jan 11 '23 10:01 ESadek-MO

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

May 26 '24 00:05 github-actions[bot]

This issue hasn't yet been resolved.

May 28 '24 08:05 gavinevans

This new activity has prompted a very useful discussion in @SciTools/peloton:

NetCDF only supports ASCII (i.e. every character must be 1 byte). Iris could do something with non-ASCII characters, but it would be Iris specific - no other library would know how to interpret it.

We're quite uncomfortable making an explicit decision here, since the Iris devs are not exposed to all the possible user cases. Since there is no official convention here, we would prefer for individual users/teams to define their own encode/decode rules, since they alone know the specifics (e.g. how many bytes are needed). This would probably take the form of a bytes array (rather than a character array), with user-written functions to write and read correctly. @gavinevans @brhooper how does this sound?

If anyone is aware of an 'official' convention that Iris should follow, please speak up 😊

May 29 '24 09:05 trexfeathers

I am not sure I understand what you mean with "NetCDF only supports ASCII (i.e. every character must be 1 byte)". Is the problem specific to string/char auxiliary coordinate values?

May 29 '24 09:05 larsbarring

I am not sure I understand what you mean with "NetCDF only supports ASCII (i.e. every character must be 1 byte)". Is the problem specific to string/char auxiliary coordinate values?

We believe you can use Unicode in NetCDF names and in string attributes, but NOT in any data arrays.

May 29 '24 09:05 trexfeathers

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

Oct 12 '25 00:10 github-actions[bot]

Will consider this for 3.15

Oct 13 '25 08:10 trexfeathers

Will consider this for 3.15

... by which I mean we should make it easy - and documented - for users to define their own encode/decode routine.

Oct 15 '25 09:10 trexfeathers

iris iris copied to clipboard

Cannot save non-ASCII characters to NetCDF

🐛 Bug Report

How To Reproduce

Expected behaviour

Environment

Additional context

iris
iris copied to clipboard