dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

DataFrame interchange protocol: datetime units

Open jorisvandenbossche opened this issue 3 years ago • 3 comments

We currenty list "datetime support" in the design document, and also listed it in the dtype docstring:

https://github.com/data-apis/dataframe-api/blob/27b8e1cb676bf10704d1dfc3dca0d0d806e2e802/protocol/dataframe_protocol.py#L142

But at the moment the spec doesn't say anything about how the datetime is stored (which resolution, or whether it supports multiple resolutions with some parametrization).

Updating the spec to mention it should be nanoseconds might be the obvious solution (since that's the only resolution pandas currently supports), but I think we should make this more flexible and allow different units (hopefully pandas will support non-nanosecond resolutions in the future, and other systems might use other resolutions by default).

jorisvandenbossche avatar Sep 13 '21 15:09 jorisvandenbossche

The spec mentions that the format string is used for datetime specification and that it uses the Arrow C Data Interface format string specification, so I'd argue this is well defined.

kkraus14 avatar Sep 13 '21 20:09 kkraus14

The spec mentions that the format string is used for datetime specification

OK, doing a second search, I found "Format strings are mostly useful for datetime specification, and for categoricals." in the notes of the dtypes docstring. That can probably be made a bit more explicit :)

But IMO there is still the question if we find this sufficient, as it would mean that you need to parse a string (to extract the resolution) to know how to interpret the buffer (but it certainly avoids needing to add more parametrization to the len-4 tuple that is currently already returned for .dtype). BTW, @rgommers, on the other hand this would also already solve the question about how to support timezones, as the Arrow C Data interface format strings include a timezone.

jorisvandenbossche avatar Sep 14 '21 06:09 jorisvandenbossche

But IMO there is still the question if we find this sufficient, as it would mean that you need to parse a string (to extract the resolution) to know how to interpret the buffer

Only if the dtype itself is datetime, right? That seems fine, because how else are we going to support timezones if not via format strings?

BTW, @rgommers, on the other hand this would also already solve the question about how to support timezones, as the Arrow C Data interface format strings include a timezone.

Yes good point. Maybe that's fine and the rest is "just" implementation (and I'm just scarred by the NumPy history).

rgommers avatar Sep 22 '21 20:09 rgommers