pandas2 Separate pd2.NaT for datetime vs timedelta

A lot of headaches are caused by the fact that pd.NaT is usually a datetime but occasionally a timedelta. In some cases this leads to unavoidable internal inconsistency (pandas-dev/pandas#19124). For pandas2 it is worth consider breaking these into two distinct constants with unambiguous types.

Jan 09 '18 16:01 jbrockmendel

I believe the API in arrow/pandas2 is currently pushing in the opposite direction, using a unified NA scalar (e.g. below). However, it will probably be easier than it sounds because missing-ness is always tracked in a separate bitmap, rather than as special sentinel values.

In [132]: import pyarrow as pa

In [133]: pa.array([1, 2, None])
Out[133]: 
<pyarrow.lib.Int64Array object at 0x000000000BCBBB88>
[
  1,
  2,
  NA
]

In [134]: pa.array([1, 2, None])[-1]
Out[134]: NA

In [135]: import datetime

In [136]: pa.array([datetime.datetime(2016, 12, 31), None])
Out[136]: 
<pyarrow.lib.TimestampArray object at 0x000000000BD2CB38>
[
  Timestamp('2016-12-31 00:00:00'),
  NA
]

In [137]: pa.array([datetime.datetime(2016, 12, 31), None])[-1]
Out[137]: NA

In [138]: type(_)
Out[138]: pyarrow.lib.NAType

Jan 09 '18 19:01 chris-b1

@chris-b1 thanks for filling me in. Is pyarrow the repo to keep an eye on to follow pd2 development?

Big if, but IIUC what you’re discussing is how a null is represented inside an array, where the array holds a dtype. I’m talking about a scalar, where NaT + TimedeltaIndex(...) is ambiguous because NaT currently quacks as both a datetime and a timedelta.

Jan 09 '18 20:01 jbrockmendel

Yeah, the vision has evolved over time, but my current (possibly incorrect) understanding is:

arrow - base, python agnostic, c++ layer, core memory layout & algos - https://github.com/apache/arrow/tree/master/cpp
pyarrow - python wrapper/access to arrow, https://github.com/apache/arrow/tree/master/python
pandas2 - TBD, wrapper around pyarrow (may be one and the same), more traditional pandas interface. (see also ibis - https://github.com/ibis-project/ibis)

arrow issues are on JIRA, here - https://issues.apache.org/jira/projects/ARROW/issues

In pyarrow, NA is also the scalar type. Not sure how this actually will work as numeric ops, etc are not implemented yet, but for instance, in theory could be:


In [144]: pa.array([1, 2, 3]) + pa.NA
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-144-09f1116a2f04> in <module>()
----> 1 pa.array([1, 2, 3]) + pa.NA

TypeError: unsupported operand type(s) for +: 'pyarrow.lib.Int64Array' and 'pyarrow.lib.NAType'

Jan 09 '18 20:01 chris-b1

pandas2 pandas2 copied to clipboard

Separate pd2.NaT for datetime vs timedelta

pandas2
pandas2 copied to clipboard