pandas2
pandas2 copied to clipboard
Separate pd2.NaT for datetime vs timedelta
A lot of headaches are caused by the fact that pd.NaT is usually a datetime but occasionally a timedelta. In some cases this leads to unavoidable internal inconsistency (pandas-dev/pandas#19124). For pandas2 it is worth consider breaking these into two distinct constants with unambiguous types.
I believe the API in arrow/pandas2 is currently pushing in the opposite direction, using a unified NA scalar (e.g. below). However, it will probably be easier than it sounds because missing-ness is always tracked in a separate bitmap, rather than as special sentinel values.
In [132]: import pyarrow as pa
In [133]: pa.array([1, 2, None])
Out[133]:
<pyarrow.lib.Int64Array object at 0x000000000BCBBB88>
[
1,
2,
NA
]
In [134]: pa.array([1, 2, None])[-1]
Out[134]: NA
In [135]: import datetime
In [136]: pa.array([datetime.datetime(2016, 12, 31), None])
Out[136]:
<pyarrow.lib.TimestampArray object at 0x000000000BD2CB38>
[
Timestamp('2016-12-31 00:00:00'),
NA
]
In [137]: pa.array([datetime.datetime(2016, 12, 31), None])[-1]
Out[137]: NA
In [138]: type(_)
Out[138]: pyarrow.lib.NAType
@chris-b1 thanks for filling me in. Is pyarrow the repo to keep an eye on to follow pd2 development?
Big if, but IIUC what you’re discussing is how a null is represented inside an array, where the array holds a dtype. I’m talking about a scalar, where NaT + TimedeltaIndex(...) is ambiguous because NaT currently quacks as both a datetime and a timedelta.
Yeah, the vision has evolved over time, but my current (possibly incorrect) understanding is:
arrow- base, python agnostic, c++ layer, core memory layout & algos - https://github.com/apache/arrow/tree/master/cpppyarrow- python wrapper/access to arrow, https://github.com/apache/arrow/tree/master/pythonpandas2- TBD, wrapper around pyarrow (may be one and the same), more traditional pandas interface. (see also ibis - https://github.com/ibis-project/ibis)
arrow issues are on JIRA, here - https://issues.apache.org/jira/projects/ARROW/issues
In pyarrow, NA is also the scalar type. Not sure how this actually will work as numeric ops, etc are not implemented yet, but for instance, in theory could be:
In [144]: pa.array([1, 2, 3]) + pa.NA
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-144-09f1116a2f04> in <module>()
----> 1 pa.array([1, 2, 3]) + pa.NA
TypeError: unsupported operand type(s) for +: 'pyarrow.lib.Int64Array' and 'pyarrow.lib.NAType'