[Python] Implement conversion between integer coded as floating points with NaN to an Arrow integer type
For example: if pandas has casted integer data to float, this would enable the integer data to be recovered (so long as the values fall in the ~2^53 floating point range for exact integer representation)
Reporter: Wes McKinney / @wesm
Note: This issue was originally created as ARROW-488. Please see the migration documentation for further details.
Miki Tebeka / @tebeka: Is the dtype still integer? I see that Pandas changes the dtype once you add a nan:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: s = pd.Series([1,2,3])
In [4]: s
Out[4]:
0 1
1 2
2 3
dtype: int64
In [5]: s[1] = np.nan
In [6]: s
Out[6]:
0 1.0
1 NaN
2 3.0
dtype: float64
Wes McKinney / @wesm: @tebeka the pandas behavior is the motivation for this JIRA
Because pandas implicitly converts from integer to float when introducing null values, the task in this JIRA is to convert (safely) from floating point with NaNs to Arrow integer types with proper nulls
In [2]: pyarrow.Array.from_list([1, 2, None, 4, None])
Out[2]:
<pyarrow.array.Int64Array object at 0x7fb24fe97bd8>
[
1,
2,
NA,
4,
NA
]
Wes McKinney / @wesm: After ARROW-618, this functionality should be more easily achievable through syntax like
Array.from_pandas(float_data, type=int64())
This would raise an exception on any values that are not safe to case (absolute value exceeding 2^53)
Wes McKinney / @wesm: This seems like it could simply be a casting option for floating point to integer conversions
Antoine Pitrou / @pitrou: Is this the same as ARROW-2135, or am I missing something here?
Wes McKinney / @wesm:
As currently scoped, yes. This functionality is not available in arrow::compute::Cast though, so perhaps we can repurpose this JIRA to add this functionality, which may be a bit more complicated (since Cast is not yet able to deal with any null sentinels at all)
Wes McKinney / @wesm:
It would be good to have an explicit cast option for this, like arr.cast(int64(), nan_as_null=True). The safe=False/True option does not provide enough control
Wes McKinney / @wesm: Circling back on this some time later. I think it would be better to implement this as a separate function (whenever someone needs it) instead of adding complexity to Cast
Wes McKinney / @wesm: This could be implemented as a standalone function in the new kernels framework
This issue hasn't had activity in a long time. If it's still being worked on, please leave a comment. Otherwise, it will be closed on 23rd June.
Labelled Status: Stale-Warning for tracking.