arkouda
arkouda copied to clipboard
Python positive Integer list becomes Float if values both GTE and LT 2^63
Using ak.array(x), python list x containing positive integer values <2^64 is converted to pdarray of float if list contains integer values >= 2^63 and also contains integer values < 2^63.
Not a big issue because using numpy uint64 arrays instead of python list eliminates the problem.
Thanks for reporting this! If I'm understanding correctly, I think you need to specify the dtype
as ak.uint64
In our latest tagged release (v2022.04.15
)
>>> ak.get_config()['arkoudaVersion']
'v2022.04.15'
>>> x = [2**63, 6, 2**63-1, 2**63+1]
>>> x
[9223372036854775808, 6, 9223372036854775807, 9223372036854775809]
>>> ak.array(x)
array([9.2233720368547758e+18 6 9.2233720368547758e+18 9.2233720368547758e+18])
>>> ak.array(x, dtype=ak.uint64)
array([9223372036854775808 6 9223372036854775808 9223372036854775808])
Note if you're using values with the high bit set (i.e. 2**64-1
), there will still be problems. This is captured in issue #1311 and will be resolved by #1312
Let me know if I didn't completely address your issue!
it's also worth noting this behavior is the same as numpy
>>> x = [2**63, 6, 2**63-1, 2**63+1]
>>> np.array(x)
>>> array([9.22337204e+18, 6.00000000e+00, 9.22337204e+18, 9.22337204e+18])
>>> np.array(x, dtype=np.uint64)
array([9223372036854775808, 6, 9223372036854775807, 9223372036854775809], dtype=uint64)
Agree that it is consistent with numpy, but believe it is undesirable behavior for arkouda if avoidable. If all values can be represented as uint64 (e.g. 64 bit hashes) that is a better default option than float. One can always convert to a float if desired, but there is no way to get back the bits of precision lost in the float conversion.
Not a bug -- but I offer the issue as suggested default behavior. Thanks.
This is related to a data error which was difficult to track down. (even though it was most likely generated by np.array(x) rather than ak.array(x))
hmmm i'm curious what @reuster986 and @mhmerrill think. Normally numpy
is our gold standard as far as desired behavior but maybe it makes sense to deviate in this case?
Yes, I'll be interested too. Thank you.
On Fri, Apr 22, 2022 at 3:14 PM pierce314159 @.***> wrote:
hmmm i'm curious what @reuster986 https://github.com/reuster986 and @mhmerrill https://github.com/mhmerrill think. Normally numpy is our gold standard as far as desired behavior but maybe it makes sense to deviate in this case?
— Reply to this email directly, view it on GitHub https://github.com/Bears-R-Us/arkouda/issues/1297#issuecomment-1106788542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYBTBA3TPQJG5YTK4QQJQHLVGL3ANANCNFSM5TWOM3JQ . You are receiving this because you authored the thread.Message ID: @.***>
I think I agree with @21771 that we should deviate from numpy in this specific case, in order to preserve the precision of uint64 values like hashes.
I am ok with the deviation in this case hopefully there are no unintended consequences. Or maybe we should have a wrapper which preserves the dtype
if all of the elements are within the range of the dtype
? I guess we would have to cruse down the list and make sure all the values are within the uint64
range and also choose the best dtype
to use... this is probably why numpy
does not have this behavior.
Just to show how pandas and numpy handle this issue. It appears like numpy defaults to float and pandas errors unless dtype is specified
>>> import pandas as pd
>>> import numpy as np
>>> x = [2**63, 6, 2**63-1, 2**63+1]
>>> np.array(x)
# array([9.22337204e+18, 6.00000000e+00, 9.22337204e+18, 9.22337204e+18])
>>> pd.array(x)
# TypeError: cannot safely cast non-equivalent float64 to int64
>>> np.array(x, dtype=np.uint64)
# array([9223372036854775808, 6, 9223372036854775807,
# 9223372036854775809], dtype=uint64)
>>> pd.array(x, dtype=np.uint64)
# <PandasArray>
# [9223372036854775808, 6, 9223372036854775807, 9223372036854775809]
# Length: 4, dtype: uint64