arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

`ak.abs(x)` fails on value -(2**63)

Open reuster986 opened this issue 3 years ago • 7 comments

When ak.abs() is called on an int64 array that includes the value -(2**63), it fails to flip the negative sign. Reproducer:

>>> x = [-1, -(2**63), 1]
>>> ak.abs(x)
array[1 -9.2233...e+18 1])
# Still negative even after abs
>>> ak.min(ak.abs(x))
-9.2233...e+18

Expected output of ak.min(ak.abs(x)) is 1.0. Note that ak.abs() always returns a float64 array... maybe we don't want this?

reuster986 avatar Jan 10 '22 21:01 reuster986

The abs value of the min(int), is still min(int) since max(int) is 2**63-1 (i.e. in 2's compliment negating the min value leaves you with the same value.) If you want this to keep returning a float array you can probably convert the int array to float first and then do abs on the floating point values. If you want it to return int instead I don't think you can have abs min(int) be correct.

FWIW I think in C/C++ abs(INT_MIN) is just undefined behavior, but I can't remember offhand what Chapel says.

ronawho avatar Jan 11 '22 12:01 ronawho

Looks like numpy.abs uses the input type to determine the output type and for int64 it also has the abs of int_min issue:

>>> import numpy as np
>>> print(np.abs(np.array([-1, -(2**63), 1], np.int64)))
[                   1 -9223372036854775808                    1]
>>> print(np.abs(np.array([-1, -(2**63), 1], np.float64)))
[1.00000000e+00 9.22337204e+18 1.00000000e+00]

There's also a numpy.fabs which always has float output type.

>>> import numpy as np
>>> print(np.fabs(np.array([-1, -(2**63), 1], np.int64)))
[1.00000000e+00 9.22337204e+18 1.00000000e+00]
>>> print(np.fabs(np.array([-1, -(2**63), 1], np.float64)))
[1.00000000e+00 9.22337204e+18 1.00000000e+00]

ronawho avatar Jan 11 '22 15:01 ronawho

diff --git a/src/EfuncMsg.chpl b/src/EfuncMsg.chpl
index eefd0944..ce8104e5 100644
--- a/src/EfuncMsg.chpl
+++ b/src/EfuncMsg.chpl
@@ -55,7 +55,9 @@ module EfuncMsg
                 {
                     when "abs" {
                         var a = st.addEntry(rname, e.size, real);
-                        a.a = Math.abs(e.a);                      
+                        forall (aa, ea) in zip(a.a, e.a) {
+                          aa = Math.abs(ea:real);
+                        }
                     }
                     when "log" {
                         var a = st.addEntry(rname, e.size, real);
diff --git a/tests/numeric_test.py b/tests/numeric_test.py
index 07cde24b..2f8d9155 100644
--- a/tests/numeric_test.py
+++ b/tests/numeric_test.py
@@ -118,6 +118,10 @@ class NumericTest(ArkoudaTest):
         self.assertEqual('type of argument "pda" must be arkouda.pdarrayclass.pdarray; got list instead', 
                         cm.exception.args[0])  
 
+        na = np.array([-1, -(2**63), 1])
+        pda = ak.array(na)
+        self.assertTrue((np.fabs(na) == ak.abs(pda).to_ndarray()).all())
+
     def testCumSum(self):
         na = np.linspace(1,10,10)
         pda = ak.array(na)

Should be the fix if you want to keep the output float64

ronawho avatar Jan 11 '22 15:01 ronawho

Thanks @ronawho . What raises the importance of this edge case is that -(2**63) is the value of pandas.NaT (Not a Time), which is like nan for Datetime and Timedelta objects. This value occurs somewhat frequently in arkouda int64 arrays. Ideally, in time-related arrays, we should treat it like a NA/nan value with respect to reductions, but there is currently no server-side distinction between regular int64 and time-related arrays.

Options:

  1. Always cast to real before abs a. Pro: minimal code change b. Con: does not handle NaT as intended
  2. Modify server to treat -(2**63) like nan in all int64 arrays (possibly change abs(int) to return int) a. Pro: integer nan would be useful b. Con: lots of server-side code changes; user cannot use -(2**63) as a value
  3. Modify client to mask out NaT elements for time-related arrays (abs(-2**63) would remain undefined behavior for regular int64) a. Pro: both int64 and time-related arrays behave as users expect b. Con: performance hit for time arrays; significant new client code

Asking for opinions from @mhmerrill , @glitch , @pierce314159 , @bradcray , and others.

reuster986 avatar Jan 11 '22 20:01 reuster986

I know we've talked about this in other tickets, but there is precedent for Option 2. In Pandas 0.24 they added Int64 versus numpy int64 as an extension dtype. I feel like if we were to follow that route either 2 things would happen. We'd either need to carry around a boolean array to mark out all of the NaN locations, or we'd define our own type in Chapel in which case we'd run into the same arguments about operator explosion (i.e. same arguments for/against adding uint(64)). Or as you said we'd need to modify a lot of code.

I think I lean towards option 3 where we could carry the mask array internally inside of a composite entry class. If no NaN values exist we don't have to create the array, we could probably do it on demand (lazy initialization).

glitch avatar Jan 12 '22 13:01 glitch

For option 1 you could also special case the code to return NaN for -(2**63):

forall (aa, ea) in zip(a.a, e.a) {
  aa = if ea == min(int) then Math.NAN else Math.abs(ea:real);
}

At a high level I like the idea of having an abs that returns int and fabs that has the current abs behavior of returning a float. Then it seems like it'd be nice to have different types for time-related things, but I don't have a good sense for how much additional code that would require. I also don't have a sense for how much special casing time objects need. Are the really just ints that want a special NaT sentinel or are there other unique things about them?

ronawho avatar Jan 12 '22 16:01 ronawho

Thanks for the input, guys. I'm waiting to make a decision on this until we resolve https://github.com/Bears-R-Us/arkouda/issues/1013#issuecomment-1017591744 , since that will determine what the time-related classes look like.

reuster986 avatar Jan 21 '22 01:01 reuster986