numpy icon indicating copy to clipboard operation
numpy copied to clipboard

BUG: Masked array default fill value can overflow

Open ngoldbaum opened this issue 1 year ago • 11 comments

Describe the issue:

For both signed and unsigned integers the default fill value is 99999, while for floats it is 1e20.

This is problematic for (u)int[8,16] as well as half floats, which do not contain the default fill value in their valid range.

Reproduce the code example:

>>> arr = np.ma.array([1, 2, 3], mask=[1, 0, 1], dtype=np.int8)
>>> arr.filled()
array([63,  2, 63], dtype=int8)

>>> arr = np.ma.array([1, 2, 3], mask=[1, 0, 1], dtype=np.float16)
>>> arr.filled()
/Users/goldbaum/Documents/numpy/build-install/usr/lib/python3.11/site-packages/numpy/ma/core.py:3873: RuntimeWarning: overflow encountered in cast
  np.copyto(result, fill_value, where=m)
Out[16]: array([inf,  2., inf], dtype=float16)

Error message:

N/A

Python and NumPy Versions:

Numpy 2.0 dev on python 3.11.

Runtime Environment:

N/A

Context for the issue:

I don't think this is an urgent but didn't see an issue describing this behavior so I'm filing this for future searchers.

That said, this does complicate the NEP 50 implementation because we need to have a number of workarounds so that this continues to work. I think it would be better to choose a default fill value that fits in the range of the data ([i,f]info.max?) for these types, but I have no idea what that entails for backward compatibility.

ngoldbaum avatar Jan 24 '24 15:01 ngoldbaum

If a specific fill value is absent in filled, it gets the default fill value for that dtype. The catch is that it gets the default fill value for that "kind" of dtype. below code from numpy.ma core code is given for reference: default_filler = {'b': True, 'c': 1.e20 + 0.0j, 'f': 1.e20, 'i': 999999, 'O': '?', 'S': b'N/A', 'u': 999999, 'V': b'???', 'U': 'N/A' }

You can see that for any kind of float (float16, float32, float64) the same fill value is being used. For a fill value of "1.e20" you always get the runtime error while typecasting (say to float16). I tried the below code to reproduce that issue: converted_value = np.array(1e+20, dtype=np.float64).astype(np.float16)

One workaround is to define fill values to each specific dtype. Another solution I tried that worked was to replace default value of float to 'inf', which did not raise any issue for any specific float dtypes. fill_value = np.inf

msavinash avatar Jan 25 '24 18:01 msavinash

The main issue here isn't so much the implementation of changing the default fill value, or even coming up with better default fill values, it's whether or not changing this will break user code. The default fill value is used in a number of places implicitly in operations so it's not clear to me if changing the default will have unintended consequences elsewhere.

ngoldbaum avatar Jan 25 '24 18:01 ngoldbaum

Since the fill value is not determined by the user, we should be able to change the default values without major code breakage. But there are chances of user code that depends on the default values (can't think of an example right off the bat). Maybe change and document it?

msavinash avatar Jan 25 '24 19:01 msavinash

Also this warning is not seen in earlier versions of Numpy (I tried with v1.23.5). So it can be changed, although I need to verify how it's being handled in earlier versions.

msavinash avatar Jan 25 '24 21:01 msavinash

Hello, I just want to add to the discussion that this issue is causing real problems in some cases. Since 2.0 does not accept out of bounds python ints this causes problems when serializing and deserializing masked arrays with default fill values. For example when multiprocessing with dask, it serializes the masked array by parts and deserializes the fill value to python integer causing an overflow exception when rebuilding the masked array. I can't tell what would be the sane default for fill_value but having something that is well behaved under any circumstance would be ideal.

Hope this helps and thank you!

BielStela avatar Jul 18 '24 09:07 BielStela

Hi, I don't think this interaction with dask has been reported. Can you make a short runnable example that demonstrates the issue?

ngoldbaum avatar Jul 18 '24 12:07 ngoldbaum

Sure,

In [1]: import numpy as np
   ...: import dask
   ...: import dask.distributed

In [2]: client = dask.distributed.Client()

In [3]: def f():
   ...:     return np.ma.masked_array([1,1], [0,1], dtype=np.uint8)
   ...:

In [4]: task = dask.delayed(f)()

In [5]: client.compute(task).result()
2024-07-18 14:48:24,570 - distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File ".venv/lib/python3.12/site-packages/numpy/ma/core.py", line 489, in _check_fill_value
    fill_value = np.asarray(fill_value, dtype=ndtype)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: Python integer 999999 out of bounds for uint8

using: Python 3.12.4 numpy 2.0.0 dask 2024.7.0

BielStela avatar Jul 18 '24 12:07 BielStela

ping @seberg in case you weren't aware of this interaction between dask, masked arrays, and the NEP 50 changes

Maybe we should use the minimum or maximum int as the default fill values for ints that are too small to store 999999? Or whatever value it overflows to that we were accidentally using before?

ngoldbaum avatar Jul 18 '24 18:07 ngoldbaum

I was faintly aware, but not that it was an issue here. A solution may be to make sure that the default fill value is typed to e.g. int64. That way NumPy will keep force-casting it. Not 100% sure if that has other fallout, but Ithink the fill-value is always an array so it should be fine.

seberg avatar Jul 19 '24 05:07 seberg

The solution we apply goes in that line: we force-cast the fill_value with numpy with something like this to force a controlled overflow

def f():
    arr = np.ma.masked_array([1,1], [0,1], dtype=np.uint8)
    arr.fill_value = arr.fill_value
    return arr

fill_value will only overflow if is the default value so It does the job avoiding serialization errors because it overflows in the setter, becoming 63~ and then it is safe to move it around. Not the prettiest thing I must say but it imitates the old (bad) behavior. Luckily in our case we can live with a trashy fill_value since we never use it to fill the array.

BielStela avatar Jul 21 '24 15:07 BielStela

With type np.uint8, default fill value can be fixed by calling _check_fill_value to force casting its type. But np.float16 can't it will always get a overflow runtime warning when use the default_value. And another issue is, _check_fill_value is define to get a (valid) fill value of a certain type, but it doesn't with np.float16.

>>> check = np.ma.core._check_fill_value
>>> test = check(None,np.float16)
>>> test
array(inf, dtype=float16)
>>> test = check(test,np.float16)
C:\Users\Eden_\Desktop\Coder\MachineLearning\my_numpy\numpy\numpy\ma\core.py:492: RuntimeWarning: overflow encountered in cast  fill_value = np.asarray(fill_value, dtype=ndtype)
>>> test
array(inf, dtype=float16)

fengluoqiuwu avatar Oct 20 '24 12:10 fengluoqiuwu

This also breaks aggregation operations for masked arrays with small dtype:

this raises an error

import numpy as np
x = np.arange(10, dtype="uint8")
a = np.ma.MaskedArray(x, x & 1, fill_value=111)
a.max(keepdims=True)

Due to fill_value being too large, error is raised from here (line 6089):

https://github.com/numpy/numpy/blob/5047f7bef3e727d1a9c0f9fa66d67bed8344b232/numpy/ma/core.py#L6082-L6089

Essentially result array is produced with .view(type(self)), which ends up losing fill_value=111 parameter, so gets replaced by 99999, then it raises an error from np.copy as you can't fit that.

I think it should be

result = self.filled(fill_value).max( 
     axis=axis, out=out, **kwargs).view(type(self))
result.fill_value = self.fill_value  

Kirill888 avatar Dec 03 '24 12:12 Kirill888

@Kirill888 there is a hot-fix that will be in 2.1.x (soon) and 2.2. (also not far future).

Although, it would be nice to figure out a clean way to handle this. Maybe something like: unless explicitly set don't usually propagate mask values (right now "explicitly set" is not defined, because the value is cached and thus considered "set" on random operations).

seberg avatar Dec 03 '24 12:12 seberg