xarray icon indicating copy to clipboard operation
xarray copied to clipboard

`numpy` 2 compatibility in the `netcdf4` and `h5netcdf` backends

Open keewis opened this issue 1 year ago • 13 comments

netcdf4 didn't have numpy 2 compatible builds before the last release, so we couldn't test that yet. This may reveal a couple of issues.

  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst

keewis avatar Jun 19 '24 17:06 keewis

@kmuehlbauer, would you have time to look into the two remaining failing tests? From what I can tell, this has been emitting a DeprecationWarning for quite a while and raises an OverflowError in numpy>=2. I fear I don't know enough about our encoding operations to debug this.

(in case you know what the issue is, feel free to push a fix to this PR)

keewis avatar Jun 19 '24 19:06 keewis

@keewis Yes, I'll have a closer look tomorrow.

kmuehlbauer avatar Jun 19 '24 19:06 kmuehlbauer

Adding error log for verbosity:

FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype0-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8
FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype1-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8

kmuehlbauer avatar Jun 20 '24 10:06 kmuehlbauer

OK, that looks like some error using a combination of packed data and _Unsigned-attribute. I'm currently lacking a upstream-dev environment on my laptop and WiFi is bad while traveling.

Xref: https://docs.unidata.ucar.edu/nug/current/best_practices.html#bp_Unsigned-Data plus the following section on packed data values.

kmuehlbauer avatar Jun 20 '24 10:06 kmuehlbauer

@keewis I have a local reproducer now, looking into this later today.

kmuehlbauer avatar Jun 21 '24 06:06 kmuehlbauer

Citing from above link: NetCDF-3 does not have unsigned integer primitive types.

The _Unsigned attribute was introduced, to notify any reading application, that the signed data on disk should be treated as unsigned in memory. Nevertheless the _FillValue need to be of on-disk type (signed).

The implementation is somewhat hacky, as we have to consider also scale/offset. My current proposal includes using the correct _FillValue within the test and use view instead of cast for transformation from uint -> int.

kmuehlbauer avatar Jun 21 '24 11:06 kmuehlbauer

thanks for looking into this and fixing it, @kmuehlbauer (and so quickly, too)!

Just so I understand the implications of using view: we will only ever encounter numpy (or cupy, which aims to be a drop-in replacement) arrays here, right?

Edit: the failing upstream-dev CI is unrelated, that's a scipy deprecation.

keewis avatar Jun 22 '24 21:06 keewis

Now that you ask that, I'm a bit unsure.

_FillValue is an attribute here, a scalar, so no arrays involved. To transform from uint to int, I've wrapped it into numpy scalar to use view and assigned it back to the attribute. Should work, but there might be another (better) solution I'm missing.

kmuehlbauer avatar Jun 23 '24 06:06 kmuehlbauer

revisiting this, I believe this is fine, though this does fail if for example data is backed by a cupy array. I couldn't get that to work even with main and numpy<2, so maybe this doesn't matter. @dcherian, any opinions? Otherwise this should be ready to merge (edit: and really, this is blocking a bunch of PRs, so merging sooner than later would be great).

keewis avatar Jun 27 '24 17:06 keewis

@keewis I'm not sure, if my fix is the ultimate solution. I'm on it with a better fitting solution today.

kmuehlbauer avatar Jun 28 '24 08:06 kmuehlbauer

well, feel free to improve/modify as you see fit!

keewis avatar Jun 28 '24 09:06 keewis

UnsignedCoder again:

That's a really nasty thing. This is what netcdf4-python tells us about it:

In addition, if maskandscale is set to True, and if the variable has an attribute _Unsigned set to "true", and the variable has a signed integer data type, a view to the data is returned with the corresponding unsigned integer data type. This convention is used by the netcdf-java library to save unsigned integer data in NETCDF3 or NETCDF4_CLASSIC files (since the NETCDF3 data model does not have unsigned integer data types).

So I think, that a view is probably the right thing to do at that location (can't think of anything better). I've had to tweak also the encoding side of things to correctly roundtrip. I've added another test to be sure everything works as expected with plain _Unsigned="true".

kmuehlbauer avatar Jun 28 '24 10:06 kmuehlbauer

the failing doctests CI will be fixed by #9177 (no idea about mypy). So if the implementation of the decoder fix is fine, this should finally be ready for merging.

keewis avatar Jul 02 '24 10:07 keewis

Presumably the numpy version in the mypy env changed? I'm OK with addressing that later. Shall we merge?

dcherian avatar Jul 11 '24 02:07 dcherian

the mypy environment uses the standard environment, which means that unpinning here surfaces the issues in that CI as well. Merging sounds good to me (I just didn't want to merge my own PR), I just didn't want to merge my own PR.

keewis avatar Jul 11 '24 08:07 keewis

I just didn't want to merge my own PR

I think we should all get a little more comfortable doing this. It's immeasurably worse, to our user base, to leave finished PRs hanging around and unreleased.

dcherian avatar Jul 11 '24 08:07 dcherian

FYI this PR seems to break the case where _FillValue is not the same type as the array. I haven't completely tracked it down, but I'm getting failures in my unstable CI with:

            if "_FillValue" in attrs:
                new_fill = np.array(attrs["_FillValue"])
                # use view here to prevent OverflowError
>               attrs["_FillValue"] = new_fill.view(signed_dtype).item()
E               ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

djhoese avatar Jul 16 '24 18:07 djhoese

Here's a reproducer:

import xarray as xr
import numpy as np

v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": -1}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})

uic = xr.coding.variables.UnsignedIntegerCoder()

uic.encode(v, "test")
# ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

Basically I'm passing 32-bit float data so signed_dtype becomes int32 in the encode method. So np.array(-1).view(np.int32) is not allowed as np.array(-1) is a 64-bit integer. Even if I pass fill value as a int16 (matching my output encoding) it still fails because it is trying to use a signed_dtype of int32 to match the 32-bit float I'm passing in.

Edit: If I pass fill value as np.array(-1, dtype=np.int32) then it seems to work.

djhoese avatar Jul 16 '24 18:07 djhoese

This works for me in latest main branch.

import xarray as xr
import numpy as np
fillvalue = np.int16(-1)
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": fillvalue}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test")

@djhoese Are you writing to NETCDF3 or NETCDF4_CLASSIC? You might also just skip the "_Unsigned" attribute and use uint16 as is when writing to NETCDF4.

kmuehlbauer avatar Jul 17 '24 13:07 kmuehlbauer

@kmuehlbauer The netcdf files being produced are NetCDF4 and are being ingested by a third-party Java application so using unsigned types directly isn't allowed (Java doesn't have unsigned types).

djhoese avatar Jul 17 '24 14:07 djhoese

It seems I can do fill value as np.array(65535, dtype=np.uint16) but then it does show up in the resulting Variable's attrs as 65535. Oh I can do plain 65535 too. I haven't tested what this shows up as in a final NetCDF file.

Edit: As far as I can find the _FillValue is always supposed to match the dtype of the packed on-disk variable. Specifying 65535 makes this not true.

djhoese avatar Jul 17 '24 14:07 djhoese

Ok I've made a PR for Satpy that I think fixes my use case. I think the main misunderstanding here is that someone used to writing NetCDF files outside of xarray and possibly familiar with CF standards will think they need _FillValue to match the dtype of the on-disk/in-file variable. Xarray is assuming here that the _FillValue is in the type of the in-memory data or at least that's what I'm telling myself. My workaround is to pass the -1 as the unsigned equivalent (65535) and then xarray does the conversion to the on-disk type (signed int) "for me".

In my opinion the fix in this PR is incorrect, but I feel like I'm missing some use case for why it needed to be changed in the first place so I'm not sure I can suggest something better.

You mentioned above that there were overflow cases being errored on in numpy 2 with the old code, but if the user is specifying _FillValue as an on-disk-dtype compatible value then that never should have happened, right?

djhoese avatar Jul 17 '24 15:07 djhoese

Providing the signed int16 equivalent (-1) as _FillValue should work, as in my example above. Does this not work?

kmuehlbauer avatar Jul 17 '24 15:07 kmuehlbauer

This seems to work perfectly fine:

import xarray as xr
import numpy as np
fillvalue = np.uint16(65535)
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": fillvalue}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test")
<xarray.Variable (y: 10, x: 10)> Size: 200B
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int16)
Attributes:
    _FillValue:  -1
    _Unsigned:   true

You can specify _FillValue either as uint or int, it will be converted to packed type dtype (int) using .view (this was introduced in this PR, since numpy2 errors now when just trying to cast).

kmuehlbauer avatar Jul 17 '24 15:07 kmuehlbauer

Yes, you're right. This works. I missed that that's what you were doing in your code. However, I'd say this isn't expected from a user point of view. I don't think I should have to specify a numpy scalar for _FillValue. Regular python integers should work too.

djhoese avatar Jul 17 '24 15:07 djhoese

@kmuehlbauer Further up you had said that the change to the new view logic was to avoid overflow complaints from numpy 2. In those cases, what was being provided as the _FillValue?

djhoese avatar Jul 17 '24 16:07 djhoese

@kmuehlbauer I have another case I just discovered after working around the -1 case as discussed. I have a dataset coming from somewhere else that is uint8 and a value of 1 is the fill value. The variable needs to be stored as a signed int8 with _Unsigned set to make the third-party application happy.

In [6]: v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.uint8), attrs={"_FillValue": 1}, encoding={"_Unsigned": "true", "dtype": "int8", "zlib": True})

In [7]: uic.encode(v, "test")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 uic.encode(v, "test")

File ~/miniforge3/envs/satpy_py312/lib/python3.12/site-packages/xarray/coding/variables.py:524, in UnsignedIntegerCoder.encode(self, variable, name)
    522     new_fill = np.array(attrs["_FillValue"])
    523     # use view here to prevent OverflowError
--> 524     attrs["_FillValue"] = new_fill.view(signed_dtype).item()
    525     # new_fill = signed_dtype.type(attrs["_FillValue"])
    526     # attrs["_FillValue"] = new_fill
    527 data = duck_array_ops.astype(duck_array_ops.around(data), signed_dtype)

ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

So the above fails because new_fill = np.array(attrs["_FillValue"]) is taking a python native integer and making it an np.int64. Now the .view(signed_dtype) logic fails because it is trying to cast a 64-bit integer to a 8-bit integer.

Note: I copied the code from this PR into my stable xarray environment so I could swap between the two behaviors quickly.

djhoese avatar Jul 17 '24 17:07 djhoese

@djhoese Obviously I did not take all possibilities into account, as it unfortunately looks like 😢. I have to admit, that the CF coding issues are getting increasingly annoying :roll_eyes:.

I'm currently traveling and may not have time to dedicate for the next 2 weeks. If you or others see immediate fixes for this, please go ahead.

kmuehlbauer avatar Jul 17 '24 19:07 kmuehlbauer

I'll see what I can do. I think this issue I've brought up can be summarized as: _FillValue should be able to be a python native integer or numpy scalar. I think the main reason it isn't working is the np.array(attrs["_FillValue"]) because it takes a python native integer and turns it into a int64 most of the time.

djhoese avatar Jul 17 '24 19:07 djhoese

@kmuehlbauer and others, question for you as I work on "fixing" this: Is the roundtrip the only thing that matters? Or can xarray accept input and assume it knows best about what the user wanted? Especially in this _Unsigned case, I'm not sure it is a clear answer. Does the user always provide a _FillValue of the output on-disk type? Or is xarray allowed to normalize any fill value to the on-disk type? If it does that normalization, wrote it to a NetCDF file, and read it back in then (I think) it wouldn't have the original _FillValue the user started with.

In the original create_unsigned_masked_scaled_data test helper function the on-disk type was int8 (i1) and the _FillValue was 255 in the encoding dict. @kmuehlbauer changed that fill value to np.int8(-1) which makes sense if the encoding of a Dataset represents what will live on-disk when the data is encoded...but what lives in .attrs["_FillValue"] when it is decoded/loaded from disk? This is confusing.

Case 1: I have 32-bit float data in-memory. I want to write it to disk to fit in an 8-bit signed NetCDF variable with _Unsigned set to true. That signed integer data should use -1 to mark invalid data. So in any .encoding dictionaries .encoding["_FillValue"] should be -1 to match the on-disk type. In .attrs["_FillValue"] should also always be -1, right? Unless the user is reading poorly formatted data, the _FillValue should never be 255. Right?

Maybe I should get more sleep before dealing with CF.

djhoese avatar Jul 18 '24 02:07 djhoese