dataframe-api
dataframe-api copied to clipboard
DataFrame interchange protocol: datetime units
We currenty list "datetime support" in the design document, and also listed it in the dtype docstring:
https://github.com/data-apis/dataframe-api/blob/27b8e1cb676bf10704d1dfc3dca0d0d806e2e802/protocol/dataframe_protocol.py#L142
But at the moment the spec doesn't say anything about how the datetime is stored (which resolution, or whether it supports multiple resolutions with some parametrization).
Updating the spec to mention it should be nanoseconds might be the obvious solution (since that's the only resolution pandas currently supports), but I think we should make this more flexible and allow different units (hopefully pandas will support non-nanosecond resolutions in the future, and other systems might use other resolutions by default).
The spec mentions that the format string is used for datetime specification and that it uses the Arrow C Data Interface format string specification, so I'd argue this is well defined.
The spec mentions that the format string is used for datetime specification
OK, doing a second search, I found "Format strings are mostly useful for datetime specification, and for categoricals." in the notes of the dtypes docstring. That can probably be made a bit more explicit :)
But IMO there is still the question if we find this sufficient, as it would mean that you need to parse a string (to extract the resolution) to know how to interpret the buffer (but it certainly avoids needing to add more parametrization to the len-4 tuple that is currently already returned for .dtype
).
BTW, @rgommers, on the other hand this would also already solve the question about how to support timezones, as the Arrow C Data interface format strings include a timezone.
But IMO there is still the question if we find this sufficient, as it would mean that you need to parse a string (to extract the resolution) to know how to interpret the buffer
Only if the dtype itself is datetime
, right? That seems fine, because how else are we going to support timezones if not via format strings?
BTW, @rgommers, on the other hand this would also already solve the question about how to support timezones, as the Arrow C Data interface format strings include a timezone.
Yes good point. Maybe that's fine and the rest is "just" implementation (and I'm just scarred by the NumPy history).