arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

Python positive Integer list becomes Float if values both GTE and LT 2^63

Open 21771 opened this issue 2 years ago • 8 comments

Using ak.array(x), python list x containing positive integer values <2^64 is converted to pdarray of float if list contains integer values >= 2^63 and also contains integer values < 2^63.

Not a big issue because using numpy uint64 arrays instead of python list eliminates the problem.

21771 avatar Apr 18 '22 17:04 21771

Thanks for reporting this! If I'm understanding correctly, I think you need to specify the dtype as ak.uint64

In our latest tagged release (v2022.04.15)

>>> ak.get_config()['arkoudaVersion']
'v2022.04.15'
>>> x = [2**63, 6, 2**63-1, 2**63+1]
>>> x
[9223372036854775808, 6, 9223372036854775807, 9223372036854775809]
>>> ak.array(x)
array([9.2233720368547758e+18 6 9.2233720368547758e+18 9.2233720368547758e+18])
>>> ak.array(x, dtype=ak.uint64)
array([9223372036854775808 6 9223372036854775808 9223372036854775808])

Note if you're using values with the high bit set (i.e. 2**64-1), there will still be problems. This is captured in issue #1311 and will be resolved by #1312

Let me know if I didn't completely address your issue!

stress-tess avatar Apr 21 '22 17:04 stress-tess

it's also worth noting this behavior is the same as numpy

>>> x = [2**63, 6, 2**63-1, 2**63+1]
>>> np.array(x)
>>> array([9.22337204e+18, 6.00000000e+00, 9.22337204e+18, 9.22337204e+18])
>>> np.array(x, dtype=np.uint64)
array([9223372036854775808,                   6, 9223372036854775807,  9223372036854775809], dtype=uint64)

stress-tess avatar Apr 21 '22 18:04 stress-tess

Agree that it is consistent with numpy, but believe it is undesirable behavior for arkouda if avoidable. If all values can be represented as uint64 (e.g. 64 bit hashes) that is a better default option than float. One can always convert to a float if desired, but there is no way to get back the bits of precision lost in the float conversion.

Not a bug -- but I offer the issue as suggested default behavior. Thanks.

This is related to a data error which was difficult to track down. (even though it was most likely generated by np.array(x) rather than ak.array(x))

21771 avatar Apr 22 '22 18:04 21771

hmmm i'm curious what @reuster986 and @mhmerrill think. Normally numpy is our gold standard as far as desired behavior but maybe it makes sense to deviate in this case?

stress-tess avatar Apr 22 '22 19:04 stress-tess

Yes, I'll be interested too. Thank you.

On Fri, Apr 22, 2022 at 3:14 PM pierce314159 @.***> wrote:

hmmm i'm curious what @reuster986 https://github.com/reuster986 and @mhmerrill https://github.com/mhmerrill think. Normally numpy is our gold standard as far as desired behavior but maybe it makes sense to deviate in this case?

— Reply to this email directly, view it on GitHub https://github.com/Bears-R-Us/arkouda/issues/1297#issuecomment-1106788542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYBTBA3TPQJG5YTK4QQJQHLVGL3ANANCNFSM5TWOM3JQ . You are receiving this because you authored the thread.Message ID: @.***>

21771 avatar Apr 25 '22 13:04 21771

I think I agree with @21771 that we should deviate from numpy in this specific case, in order to preserve the precision of uint64 values like hashes.

reuster986 avatar Apr 26 '22 19:04 reuster986

I am ok with the deviation in this case hopefully there are no unintended consequences. Or maybe we should have a wrapper which preserves the dtype if all of the elements are within the range of the dtype? I guess we would have to cruse down the list and make sure all the values are within the uint64 range and also choose the best dtype to use... this is probably why numpy does not have this behavior.

mhmerrill avatar Apr 28 '22 15:04 mhmerrill

Just to show how pandas and numpy handle this issue. It appears like numpy defaults to float and pandas errors unless dtype is specified

>>> import pandas as pd
>>> import numpy as np
>>> x = [2**63, 6, 2**63-1, 2**63+1]
>>> np.array(x)
# array([9.22337204e+18, 6.00000000e+00, 9.22337204e+18, 9.22337204e+18])

>>> pd.array(x)
# TypeError: cannot safely cast non-equivalent float64 to int64

>>> np.array(x, dtype=np.uint64)
# array([9223372036854775808,                   6, 9223372036854775807,
#        9223372036854775809], dtype=uint64)

>>> pd.array(x, dtype=np.uint64)
# <PandasArray>
# [9223372036854775808, 6, 9223372036854775807, 9223372036854775809]
# Length: 4, dtype: uint64

stress-tess avatar May 10 '22 17:05 stress-tess