pandas2
pandas2 copied to clipboard
Separate pd2.NaT for datetime vs timedelta
A lot of headaches are caused by the fact that pd.NaT
is usually a datetime
but occasionally a timedelta
. In some cases this leads to unavoidable internal inconsistency (pandas-dev/pandas#19124). For pandas2 it is worth consider breaking these into two distinct constants with unambiguous types.
I believe the API in arrow/pandas2 is currently pushing in the opposite direction, using a unified NA scalar (e.g. below). However, it will probably be easier than it sounds because missing-ness is always tracked in a separate bitmap, rather than as special sentinel values.
In [132]: import pyarrow as pa
In [133]: pa.array([1, 2, None])
Out[133]:
<pyarrow.lib.Int64Array object at 0x000000000BCBBB88>
[
1,
2,
NA
]
In [134]: pa.array([1, 2, None])[-1]
Out[134]: NA
In [135]: import datetime
In [136]: pa.array([datetime.datetime(2016, 12, 31), None])
Out[136]:
<pyarrow.lib.TimestampArray object at 0x000000000BD2CB38>
[
Timestamp('2016-12-31 00:00:00'),
NA
]
In [137]: pa.array([datetime.datetime(2016, 12, 31), None])[-1]
Out[137]: NA
In [138]: type(_)
Out[138]: pyarrow.lib.NAType
@chris-b1 thanks for filling me in. Is pyarrow the repo to keep an eye on to follow pd2 development?
Big if, but IIUC what you’re discussing is how a null is represented inside an array, where the array holds a dtype. I’m talking about a scalar, where NaT + TimedeltaIndex(...) is ambiguous because NaT currently quacks as both a datetime and a timedelta.
Yeah, the vision has evolved over time, but my current (possibly incorrect) understanding is:
-
arrow
- base, python agnostic, c++ layer, core memory layout & algos - https://github.com/apache/arrow/tree/master/cpp -
pyarrow
- python wrapper/access to arrow, https://github.com/apache/arrow/tree/master/python -
pandas2
- TBD, wrapper around pyarrow (may be one and the same), more traditional pandas interface. (see also ibis - https://github.com/ibis-project/ibis)
arrow issues are on JIRA, here - https://issues.apache.org/jira/projects/ARROW/issues
In pyarrow
, NA
is also the scalar type. Not sure how this actually will work as numeric ops, etc are not implemented yet, but for instance, in theory could be:
In [144]: pa.array([1, 2, 3]) + pa.NA
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-144-09f1116a2f04> in <module>()
----> 1 pa.array([1, 2, 3]) + pa.NA
TypeError: unsupported operand type(s) for +: 'pyarrow.lib.Int64Array' and 'pyarrow.lib.NAType'