arkouda
arkouda copied to clipboard
`ak.abs(x)` fails on value -(2**63)
When ak.abs()
is called on an int64
array that includes the value -(2**63), it fails to flip the negative sign. Reproducer:
>>> x = [-1, -(2**63), 1]
>>> ak.abs(x)
array[1 -9.2233...e+18 1])
# Still negative even after abs
>>> ak.min(ak.abs(x))
-9.2233...e+18
Expected output of ak.min(ak.abs(x))
is 1.0. Note that ak.abs()
always returns a float64
array... maybe we don't want this?
The abs value of the min(int)
, is still min(int)
since max(int)
is 2**63-1
(i.e. in 2's compliment negating the min value leaves you with the same value.) If you want this to keep returning a float array you can probably convert the int array to float first and then do abs on the floating point values. If you want it to return int instead I don't think you can have abs min(int)
be correct.
FWIW I think in C/C++ abs(INT_MIN)
is just undefined behavior, but I can't remember offhand what Chapel says.
Looks like numpy.abs
uses the input type to determine the output type and for int64 it also has the abs of int_min issue:
>>> import numpy as np
>>> print(np.abs(np.array([-1, -(2**63), 1], np.int64)))
[ 1 -9223372036854775808 1]
>>> print(np.abs(np.array([-1, -(2**63), 1], np.float64)))
[1.00000000e+00 9.22337204e+18 1.00000000e+00]
There's also a numpy.fabs
which always has float output type.
>>> import numpy as np
>>> print(np.fabs(np.array([-1, -(2**63), 1], np.int64)))
[1.00000000e+00 9.22337204e+18 1.00000000e+00]
>>> print(np.fabs(np.array([-1, -(2**63), 1], np.float64)))
[1.00000000e+00 9.22337204e+18 1.00000000e+00]
diff --git a/src/EfuncMsg.chpl b/src/EfuncMsg.chpl
index eefd0944..ce8104e5 100644
--- a/src/EfuncMsg.chpl
+++ b/src/EfuncMsg.chpl
@@ -55,7 +55,9 @@ module EfuncMsg
{
when "abs" {
var a = st.addEntry(rname, e.size, real);
- a.a = Math.abs(e.a);
+ forall (aa, ea) in zip(a.a, e.a) {
+ aa = Math.abs(ea:real);
+ }
}
when "log" {
var a = st.addEntry(rname, e.size, real);
diff --git a/tests/numeric_test.py b/tests/numeric_test.py
index 07cde24b..2f8d9155 100644
--- a/tests/numeric_test.py
+++ b/tests/numeric_test.py
@@ -118,6 +118,10 @@ class NumericTest(ArkoudaTest):
self.assertEqual('type of argument "pda" must be arkouda.pdarrayclass.pdarray; got list instead',
cm.exception.args[0])
+ na = np.array([-1, -(2**63), 1])
+ pda = ak.array(na)
+ self.assertTrue((np.fabs(na) == ak.abs(pda).to_ndarray()).all())
+
def testCumSum(self):
na = np.linspace(1,10,10)
pda = ak.array(na)
Should be the fix if you want to keep the output float64
Thanks @ronawho . What raises the importance of this edge case is that -(2**63)
is the value of pandas.NaT
(Not a Time), which is like nan
for Datetime and Timedelta objects. This value occurs somewhat frequently in arkouda int64 arrays. Ideally, in time-related arrays, we should treat it like a NA/nan
value with respect to reductions, but there is currently no server-side distinction between regular int64 and time-related arrays.
Options:
- Always cast to
real
beforeabs
a. Pro: minimal code change b. Con: does not handleNaT
as intended - Modify server to treat
-(2**63)
likenan
in all int64 arrays (possibly changeabs(int)
to returnint
) a. Pro: integernan
would be useful b. Con: lots of server-side code changes; user cannot use-(2**63)
as a value - Modify client to mask out
NaT
elements for time-related arrays (abs(-2**63)
would remain undefined behavior for regular int64) a. Pro: both int64 and time-related arrays behave as users expect b. Con: performance hit for time arrays; significant new client code
Asking for opinions from @mhmerrill , @glitch , @pierce314159 , @bradcray , and others.
I know we've talked about this in other tickets, but there is precedent for Option 2. In Pandas 0.24 they added Int64
versus numpy int64
as an extension dtype. I feel like if we were to follow that route either 2 things would happen. We'd either need to carry around a boolean array to mark out all of the NaN locations, or we'd define our own type in Chapel in which case we'd run into the same arguments about operator explosion (i.e. same arguments for/against adding uint(64)). Or as you said we'd need to modify a lot of code.
I think I lean towards option 3 where we could carry the mask array internally inside of a composite entry class. If no NaN values exist we don't have to create the array, we could probably do it on demand (lazy initialization).
For option 1 you could also special case the code to return NaN
for -(2**63)
:
forall (aa, ea) in zip(a.a, e.a) {
aa = if ea == min(int) then Math.NAN else Math.abs(ea:real);
}
At a high level I like the idea of having an abs
that returns int and fabs
that has the current abs
behavior of returning a float. Then it seems like it'd be nice to have different types for time-related things, but I don't have a good sense for how much additional code that would require. I also don't have a sense for how much special casing time objects need. Are the really just ints that want a special NaT
sentinel or are there other unique things about them?
Thanks for the input, guys. I'm waiting to make a decision on this until we resolve https://github.com/Bears-R-Us/arkouda/issues/1013#issuecomment-1017591744 , since that will determine what the time-related classes look like.