netcdf4-python icon indicating copy to clipboard operation
netcdf4-python copied to clipboard

NFC normalize strings?

Open ChrisBarker-NOAA opened this issue 1 year ago • 4 comments

The NUG indicates that strings (dimension and variable names, anyway) should be NFC normalized.

""" ... names are normalized according to Unicode NFC normalization rules during encoding as UTF-8 for storing in the file header. This is necessary to ensure that gratuitous differences in the representation of Unicode names do not cause anomalies in comparing files and querying data objects by name. """

(and next CF release will specify NFC normalization for all text)

But as far as I can tell, netCDF4 isn't doing that. It probably should.

I think it may be as easy as adding:

import unicodedata
pystr = unicodedata.normalize('NFC', pystr)

to _strencode()

Granted -- this does mean that users may get something slightly different back when they round-trip a anme through netcdf.

If that's a concern, the you could call unicodedata.is_normalized, and raiae an error instead.

ChrisBarker-NOAA avatar Oct 28 '24 06:10 ChrisBarker-NOAA

That section of the NUG only applies to netcdf classic, not HDF5. Plus, I read that as meaning that the library does that for you (so the python layer doesn't need to).

jswhit avatar Oct 29 '24 00:10 jswhit

Hmm -- I'm pretty sure that all variable and dimension names are supposed to be NFC normalized. The sectionof the NUG does talk about he Header, so yes, probably only vital for netcdf classic. But still a good idea, and CF will be requiring it anyway.

The search on the NUG is broken, so I'm having a hard time finding what I'm looking for :-(

The library does that for you,

I doubt it -- but worth a look. It would be great if it did.

I'll try to poke into it.

ChrisBarker-NOAA avatar Oct 29 '24 00:10 ChrisBarker-NOAA

OK -- I've poked into it, and you are completely correct -- the netCDF C lib is NFC normalizing variable names. Here's an experiment with netCDF4:

import  netCDF4
import unicodedata


normal_name = "composed\u00E9"

non_normal_name = "separate\u0065\u0301"

with netCDF4.Dataset("nfc-norm.nc", 'w') as ds:
    dim = ds.createDimension("a_dim", 10)
    var1 = ds.createVariable(normal_name, float, ("a_dim"))
    var2 = ds.createVariable(non_normal_name, float, ("a_dim"))
    var1[:] = range(10)
    var2[:] = range(10)


with netCDF4.Dataset("nfc-norm.nc", 'r') as ds:
    # get the vars from their original names
    try:
        norm = ds[normal_name]
        print(f"{normal_name} worked")
    except IndexError:
        print(f"{normal_name} didn't work")

    try:
        non_norm = ds[non_normal_name]
        print(f"{non_normal_name} worked")
    except IndexError:
        print(f"{non_normal_name} didn't work")
        non_norm = ds[unicodedata.normalize('NFC', non_normal_name)]
        print(f"But it  did once normalized!")

    for name in ds.variables.keys():
        assert unicodedata.is_normalized('NFC', name)
    print("All variable names are normalized")

And when run:

In [54]: run nfc_norm.py
composedé worked
separateé didn't work
But it  did once normalized!
All variable names are normalized

So indeed, the C lib is doing it for you -- nothing to be done here.

Except maybe a note in the docs ...

ChrisBarker-NOAA avatar Oct 29 '24 01:10 ChrisBarker-NOAA

Another potential issue -- not sure if this is something that should be built in to the lib:

The next version of CF will specify that attributes should be NFC normalized. This is because a number of CF attributes reference variable names, so they really need to be exact / compare equally.

I just tested, and string attributes are not being normalized.

So the netCDF4 lib could normalize attributes too.

(so could the C lib, but I'm guessing they won't want to go there -- it's not critical to netcdf itself)

ChrisBarker-NOAA avatar Oct 29 '24 18:10 ChrisBarker-NOAA