pandas icon indicating copy to clipboard operation
pandas copied to clipboard

ENH: Add `unit` argument to `to_datetime` and `to_timedelta` to avoid value-dependent parsing

Open TomAugspurger opened this issue 1 month ago • 3 comments

Feature Type

  • [x] Adding new functionality to pandas

  • [ ] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

Over in https://github.com/dask/dask/issues/12178#issuecomment-3604828151, we're discussing how dask should adapt to the new datetime / timedelta resolution inference.

The new behavior is value-dependent: you don't know what dtype the result will be until you run it on the values. This is a challenge for dask, which might process subsets of the data in parallel, but would like each partition of a column to have the same data type:

In [9]: values = ["1", "2", "1 day 2 hours"]

In [10]: pd.to_timedelta(s[:2])
Out[10]: 
0   0 days 00:00:00.000000001
1   0 days 00:00:00.000000002
dtype: timedelta64[ns]

In [11]: pd.to_timedelta(s[2:])
Out[11]: 
2   1 days 02:00:00
dtype: timedelta64[us]

The first partition (s[:2]) are inferred to be timedelta64[ns] while the second partition (s[2:]) are inferred to be timedelta64[us].

Feature Description

Add a dtype or resolution parameter to to_datetime and to_timedelta:

pd.to_datetime(
    arg: 'DatetimeScalarOrArrayConvertible | DictConvertible',
    errors: 'DateTimeErrorChoices' = 'raise',
    dayfirst: 'bool' = False,
    yearfirst: 'bool' = False,
    utc: 'bool' = False,
    format: 'str | None' = None,
    exact: 'bool | lib.NoDefault' = <no_default>,
    unit: 'str | None' = None,
    infer_datetime_format: 'lib.NoDefault | bool' = <no_default>,
    origin: 'str' = 'unix',
    cache: 'bool' = True,
    resolution: 'Resolution | None' = None,
) -> 'DatetimeIndex | Series | DatetimeScalar | NaTType | None'
"""
...
dtype: Dtype, optional
    Controls the resolution of the result.
"""

This would ideally be implemented as pd.to_datetime(...).as_unit(resolution).

Alternative Solutions

Dask could just go on its own here and add that resolution keyword. But I suspect other workloads might benefit from knowing exactly what range they'll get out.

Additional Context

There's some complexity here in how this proposed resolution keywords: in particular unit (how you interpret numeric values) and errors (what happens if you specify a value that is out of bounds for the resolution you provide?). I'd be curious to hear if those downsides outweigh any benefits.

TomAugspurger avatar Dec 05 '25 02:12 TomAugspurger

cc @jbrockmendel @jorisvandenbossche

rhshadrach avatar Dec 05 '25 12:12 rhshadrach

This is a tough one. The correct API would be to have the unit keyword control the output unit and change the existing unit keyword to input_unit (xref #62440) but that requires a deprecation cycle. Using a different name instead of unit would mean we use different names for it in different places. And "resolution" in particular has the problem that DatetimeIndex and TimedeltaIndex have a resolution attribute that means yet another thing.

jbrockmendel avatar Dec 05 '25 15:12 jbrockmendel

Yeah, that's tricky :/ I'm not sure I have a good suggestion. Perhaps waiting for https://github.com/pandas-dev/pandas/pull/62440 to go in and free up the unit keyword is the best option long-term.

TomAugspurger avatar Dec 05 '25 20:12 TomAugspurger

If we know that we want unit long term to mean the unit of the return value (and start deprecating the current one pointing to input_unit or something like that), I do think that we could already start using unit for all cases where that keyword is currently not valid, i.e. for everything non-numeric.

jorisvandenbossche avatar Dec 11 '25 08:12 jorisvandenbossche

If we know that we want unit long term to mean the unit of the return value (and start deprecating the current one pointing to input_unit or something like that), I do think that we could already start using unit for all cases where that keyword is currently not valid, i.e. for everything non-numeric.

Maybe I'll feel differently after my caffeine kicks in, but this strikes me as sketchy. Tough to document clearly. Awkward to deprecate the input-interpration cases. Potentially ambiguous for object-dtype cases where we have to figure out which way to interpret the keyword.

jbrockmendel avatar Dec 11 '25 15:12 jbrockmendel

Do we not like something like result_unit?

rhshadrach avatar Dec 11 '25 21:12 rhshadrach