ENH: Add `unit` argument to `to_datetime` and `to_timedelta` to avoid value-dependent parsing
Feature Type
-
[x] Adding new functionality to pandas
-
[ ] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
Over in https://github.com/dask/dask/issues/12178#issuecomment-3604828151, we're discussing how dask should adapt to the new datetime / timedelta resolution inference.
The new behavior is value-dependent: you don't know what dtype the result will be until you run it on the values. This is a challenge for dask, which might process subsets of the data in parallel, but would like each partition of a column to have the same data type:
In [9]: values = ["1", "2", "1 day 2 hours"]
In [10]: pd.to_timedelta(s[:2])
Out[10]:
0 0 days 00:00:00.000000001
1 0 days 00:00:00.000000002
dtype: timedelta64[ns]
In [11]: pd.to_timedelta(s[2:])
Out[11]:
2 1 days 02:00:00
dtype: timedelta64[us]
The first partition (s[:2]) are inferred to be timedelta64[ns] while the second partition (s[2:]) are inferred to be timedelta64[us].
Feature Description
Add a dtype or resolution parameter to to_datetime and to_timedelta:
pd.to_datetime(
arg: 'DatetimeScalarOrArrayConvertible | DictConvertible',
errors: 'DateTimeErrorChoices' = 'raise',
dayfirst: 'bool' = False,
yearfirst: 'bool' = False,
utc: 'bool' = False,
format: 'str | None' = None,
exact: 'bool | lib.NoDefault' = <no_default>,
unit: 'str | None' = None,
infer_datetime_format: 'lib.NoDefault | bool' = <no_default>,
origin: 'str' = 'unix',
cache: 'bool' = True,
resolution: 'Resolution | None' = None,
) -> 'DatetimeIndex | Series | DatetimeScalar | NaTType | None'
"""
...
dtype: Dtype, optional
Controls the resolution of the result.
"""
This would ideally be implemented as pd.to_datetime(...).as_unit(resolution).
Alternative Solutions
Dask could just go on its own here and add that resolution keyword. But I suspect other workloads might benefit from knowing exactly what range they'll get out.
Additional Context
There's some complexity here in how this proposed resolution keywords: in particular unit (how you interpret numeric values) and errors (what happens if you specify a value that is out of bounds for the resolution you provide?). I'd be curious to hear if those downsides outweigh any benefits.
cc @jbrockmendel @jorisvandenbossche
This is a tough one. The correct API would be to have the unit keyword control the output unit and change the existing unit keyword to input_unit (xref #62440) but that requires a deprecation cycle. Using a different name instead of unit would mean we use different names for it in different places. And "resolution" in particular has the problem that DatetimeIndex and TimedeltaIndex have a resolution attribute that means yet another thing.
Yeah, that's tricky :/ I'm not sure I have a good suggestion. Perhaps waiting for https://github.com/pandas-dev/pandas/pull/62440 to go in and free up the unit keyword is the best option long-term.
If we know that we want unit long term to mean the unit of the return value (and start deprecating the current one pointing to input_unit or something like that), I do think that we could already start using unit for all cases where that keyword is currently not valid, i.e. for everything non-numeric.
If we know that we want unit long term to mean the unit of the return value (and start deprecating the current one pointing to input_unit or something like that), I do think that we could already start using unit for all cases where that keyword is currently not valid, i.e. for everything non-numeric.
Maybe I'll feel differently after my caffeine kicks in, but this strikes me as sketchy. Tough to document clearly. Awkward to deprecate the input-interpration cases. Potentially ambiguous for object-dtype cases where we have to figure out which way to interpret the keyword.
Do we not like something like result_unit?