pandas2 icon indicating copy to clipboard operation
pandas2 copied to clipboard

Separate pd2.NaT for datetime vs timedelta

Open jbrockmendel opened this issue 7 years ago • 3 comments

A lot of headaches are caused by the fact that pd.NaT is usually a datetime but occasionally a timedelta. In some cases this leads to unavoidable internal inconsistency (pandas-dev/pandas#19124). For pandas2 it is worth consider breaking these into two distinct constants with unambiguous types.

jbrockmendel avatar Jan 09 '18 16:01 jbrockmendel

I believe the API in arrow/pandas2 is currently pushing in the opposite direction, using a unified NA scalar (e.g. below). However, it will probably be easier than it sounds because missing-ness is always tracked in a separate bitmap, rather than as special sentinel values.

In [132]: import pyarrow as pa

In [133]: pa.array([1, 2, None])
<pyarrow.lib.Int64Array object at 0x000000000BCBBB88>

In [134]: pa.array([1, 2, None])[-1]
Out[134]: NA

In [135]: import datetime

In [136]: pa.array([datetime.datetime(2016, 12, 31), None])
<pyarrow.lib.TimestampArray object at 0x000000000BD2CB38>
  Timestamp('2016-12-31 00:00:00'),

In [137]: pa.array([datetime.datetime(2016, 12, 31), None])[-1]
Out[137]: NA

In [138]: type(_)
Out[138]: pyarrow.lib.NAType

chris-b1 avatar Jan 09 '18 19:01 chris-b1

@chris-b1 thanks for filling me in. Is pyarrow the repo to keep an eye on to follow pd2 development?

Big if, but IIUC what you’re discussing is how a null is represented inside an array, where the array holds a dtype. I’m talking about a scalar, where NaT + TimedeltaIndex(...) is ambiguous because NaT currently quacks as both a datetime and a timedelta.

jbrockmendel avatar Jan 09 '18 20:01 jbrockmendel

Yeah, the vision has evolved over time, but my current (possibly incorrect) understanding is:

  • arrow - base, python agnostic, c++ layer, core memory layout & algos -
  • pyarrow - python wrapper/access to arrow,
  • pandas2 - TBD, wrapper around pyarrow (may be one and the same), more traditional pandas interface. (see also ibis -

arrow issues are on JIRA, here -

In pyarrow, NA is also the scalar type. Not sure how this actually will work as numeric ops, etc are not implemented yet, but for instance, in theory could be:

In [144]: pa.array([1, 2, 3]) + pa.NA
TypeError                                 Traceback (most recent call last)
<ipython-input-144-09f1116a2f04> in <module>()
----> 1 pa.array([1, 2, 3]) + pa.NA

TypeError: unsupported operand type(s) for +: 'pyarrow.lib.Int64Array' and 'pyarrow.lib.NAType'

chris-b1 avatar Jan 09 '18 20:01 chris-b1