xarray
xarray copied to clipboard
`numpy` 2 compatibility in the `netcdf4` and `h5netcdf` backends
netcdf4 didn't have numpy 2 compatible builds before the last release, so we couldn't test that yet. This may reveal a couple of issues.
- [x] User visible changes (including notable bug fixes) are documented in
whats-new.rst
@kmuehlbauer, would you have time to look into the two remaining failing tests? From what I can tell, this has been emitting a DeprecationWarning for quite a while and raises an OverflowError in numpy>=2. I fear I don't know enough about our encoding operations to debug this.
(in case you know what the issue is, feel free to push a fix to this PR)
@keewis Yes, I'll have a closer look tomorrow.
Adding error log for verbosity:
FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype0-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8
FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype1-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8
OK, that looks like some error using a combination of packed data and _Unsigned-attribute.
I'm currently lacking a upstream-dev environment on my laptop and WiFi is bad while traveling.
Xref: https://docs.unidata.ucar.edu/nug/current/best_practices.html#bp_Unsigned-Data plus the following section on packed data values.
@keewis I have a local reproducer now, looking into this later today.
Citing from above link: NetCDF-3 does not have unsigned integer primitive types.
The _Unsigned attribute was introduced, to notify any reading application, that the signed data on disk should be treated as unsigned in memory. Nevertheless the _FillValue need to be of on-disk type (signed).
The implementation is somewhat hacky, as we have to consider also scale/offset. My current proposal includes using the correct _FillValue within the test and use view instead of cast for transformation from uint -> int.
thanks for looking into this and fixing it, @kmuehlbauer (and so quickly, too)!
Just so I understand the implications of using view: we will only ever encounter numpy (or cupy, which aims to be a drop-in replacement) arrays here, right?
Edit: the failing upstream-dev CI is unrelated, that's a scipy deprecation.
Now that you ask that, I'm a bit unsure.
_FillValue is an attribute here, a scalar, so no arrays involved. To transform from uint to int, I've wrapped it into numpy scalar to use view and assigned it back to the attribute. Should work, but there might be another (better) solution I'm missing.
revisiting this, I believe this is fine, though this does fail if for example data is backed by a cupy array. I couldn't get that to work even with main and numpy<2, so maybe this doesn't matter. @dcherian, any opinions? Otherwise this should be ready to merge (edit: and really, this is blocking a bunch of PRs, so merging sooner than later would be great).
@keewis I'm not sure, if my fix is the ultimate solution. I'm on it with a better fitting solution today.
well, feel free to improve/modify as you see fit!
UnsignedCoder again:
That's a really nasty thing. This is what netcdf4-python tells us about it:
In addition, if maskandscale is set to True, and if the variable has an attribute _Unsigned set to "true", and the variable has a signed integer data type, a view to the data is returned with the corresponding unsigned integer data type. This convention is used by the netcdf-java library to save unsigned integer data in NETCDF3 or NETCDF4_CLASSIC files (since the NETCDF3 data model does not have unsigned integer data types).
So I think, that a view is probably the right thing to do at that location (can't think of anything better). I've had to tweak also the encoding side of things to correctly roundtrip. I've added another test to be sure everything works as expected with plain _Unsigned="true".
the failing doctests CI will be fixed by #9177 (no idea about mypy). So if the implementation of the decoder fix is fine, this should finally be ready for merging.
Presumably the numpy version in the mypy env changed? I'm OK with addressing that later. Shall we merge?
the mypy environment uses the standard environment, which means that unpinning here surfaces the issues in that CI as well. Merging sounds good to me (I just didn't want to merge my own PR), I just didn't want to merge my own PR.
I just didn't want to merge my own PR
I think we should all get a little more comfortable doing this. It's immeasurably worse, to our user base, to leave finished PRs hanging around and unreleased.
FYI this PR seems to break the case where _FillValue is not the same type as the array. I haven't completely tracked it down, but I'm getting failures in my unstable CI with:
if "_FillValue" in attrs:
new_fill = np.array(attrs["_FillValue"])
# use view here to prevent OverflowError
> attrs["_FillValue"] = new_fill.view(signed_dtype).item()
E ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged
Here's a reproducer:
import xarray as xr
import numpy as np
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": -1}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test")
# ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged
Basically I'm passing 32-bit float data so signed_dtype becomes int32 in the encode method. So np.array(-1).view(np.int32) is not allowed as np.array(-1) is a 64-bit integer. Even if I pass fill value as a int16 (matching my output encoding) it still fails because it is trying to use a signed_dtype of int32 to match the 32-bit float I'm passing in.
Edit: If I pass fill value as np.array(-1, dtype=np.int32) then it seems to work.
This works for me in latest main branch.
import xarray as xr
import numpy as np
fillvalue = np.int16(-1)
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": fillvalue}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test")
@djhoese Are you writing to NETCDF3 or NETCDF4_CLASSIC? You might also just skip the "_Unsigned" attribute and use uint16 as is when writing to NETCDF4.
@kmuehlbauer The netcdf files being produced are NetCDF4 and are being ingested by a third-party Java application so using unsigned types directly isn't allowed (Java doesn't have unsigned types).
It seems I can do fill value as np.array(65535, dtype=np.uint16) but then it does show up in the resulting Variable's attrs as 65535. Oh I can do plain 65535 too. I haven't tested what this shows up as in a final NetCDF file.
Edit: As far as I can find the _FillValue is always supposed to match the dtype of the packed on-disk variable. Specifying 65535 makes this not true.
Ok I've made a PR for Satpy that I think fixes my use case. I think the main misunderstanding here is that someone used to writing NetCDF files outside of xarray and possibly familiar with CF standards will think they need _FillValue to match the dtype of the on-disk/in-file variable. Xarray is assuming here that the _FillValue is in the type of the in-memory data or at least that's what I'm telling myself. My workaround is to pass the -1 as the unsigned equivalent (65535) and then xarray does the conversion to the on-disk type (signed int) "for me".
In my opinion the fix in this PR is incorrect, but I feel like I'm missing some use case for why it needed to be changed in the first place so I'm not sure I can suggest something better.
You mentioned above that there were overflow cases being errored on in numpy 2 with the old code, but if the user is specifying _FillValue as an on-disk-dtype compatible value then that never should have happened, right?
Providing the signed int16 equivalent (-1) as _FillValue should work, as in my example above. Does this not work?
This seems to work perfectly fine:
import xarray as xr
import numpy as np
fillvalue = np.uint16(65535)
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": fillvalue}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test")
<xarray.Variable (y: 10, x: 10)> Size: 200B
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int16)
Attributes:
_FillValue: -1
_Unsigned: true
You can specify _FillValue either as uint or int, it will be converted to packed type dtype (int) using .view (this was introduced in this PR, since numpy2 errors now when just trying to cast).
Yes, you're right. This works. I missed that that's what you were doing in your code. However, I'd say this isn't expected from a user point of view. I don't think I should have to specify a numpy scalar for _FillValue. Regular python integers should work too.
@kmuehlbauer Further up you had said that the change to the new view logic was to avoid overflow complaints from numpy 2. In those cases, what was being provided as the _FillValue?
@kmuehlbauer I have another case I just discovered after working around the -1 case as discussed. I have a dataset coming from somewhere else that is uint8 and a value of 1 is the fill value. The variable needs to be stored as a signed int8 with _Unsigned set to make the third-party application happy.
In [6]: v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.uint8), attrs={"_FillValue": 1}, encoding={"_Unsigned": "true", "dtype": "int8", "zlib": True})
In [7]: uic.encode(v, "test")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[7], line 1
----> 1 uic.encode(v, "test")
File ~/miniforge3/envs/satpy_py312/lib/python3.12/site-packages/xarray/coding/variables.py:524, in UnsignedIntegerCoder.encode(self, variable, name)
522 new_fill = np.array(attrs["_FillValue"])
523 # use view here to prevent OverflowError
--> 524 attrs["_FillValue"] = new_fill.view(signed_dtype).item()
525 # new_fill = signed_dtype.type(attrs["_FillValue"])
526 # attrs["_FillValue"] = new_fill
527 data = duck_array_ops.astype(duck_array_ops.around(data), signed_dtype)
ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged
So the above fails because new_fill = np.array(attrs["_FillValue"]) is taking a python native integer and making it an np.int64. Now the .view(signed_dtype) logic fails because it is trying to cast a 64-bit integer to a 8-bit integer.
Note: I copied the code from this PR into my stable xarray environment so I could swap between the two behaviors quickly.
@djhoese Obviously I did not take all possibilities into account, as it unfortunately looks like 😢. I have to admit, that the CF coding issues are getting increasingly annoying :roll_eyes:.
I'm currently traveling and may not have time to dedicate for the next 2 weeks. If you or others see immediate fixes for this, please go ahead.
I'll see what I can do. I think this issue I've brought up can be summarized as: _FillValue should be able to be a python native integer or numpy scalar. I think the main reason it isn't working is the np.array(attrs["_FillValue"]) because it takes a python native integer and turns it into a int64 most of the time.
@kmuehlbauer and others, question for you as I work on "fixing" this: Is the roundtrip the only thing that matters? Or can xarray accept input and assume it knows best about what the user wanted? Especially in this _Unsigned case, I'm not sure it is a clear answer. Does the user always provide a _FillValue of the output on-disk type? Or is xarray allowed to normalize any fill value to the on-disk type? If it does that normalization, wrote it to a NetCDF file, and read it back in then (I think) it wouldn't have the original _FillValue the user started with.
In the original create_unsigned_masked_scaled_data test helper function the on-disk type was int8 (i1) and the _FillValue was 255 in the encoding dict. @kmuehlbauer changed that fill value to np.int8(-1) which makes sense if the encoding of a Dataset represents what will live on-disk when the data is encoded...but what lives in .attrs["_FillValue"] when it is decoded/loaded from disk? This is confusing.
Case 1: I have 32-bit float data in-memory. I want to write it to disk to fit in an 8-bit signed NetCDF variable with _Unsigned set to true. That signed integer data should use -1 to mark invalid data. So in any .encoding dictionaries .encoding["_FillValue"] should be -1 to match the on-disk type. In .attrs["_FillValue"] should also always be -1, right? Unless the user is reading poorly formatted data, the _FillValue should never be 255. Right?
Maybe I should get more sleep before dealing with CF.