pandas PDEP-4: consistent parsing of datetimes

Here's my proposal to address https://github.com/pandas-dev/pandas/issues/12585

pdep4_2

Sep 18 '22 13:09 MarcoGorelli

Dear @MarcoGorelli, I have read your commit and it seems ok to me.

Ahmet

Sep 18 '22 15:09 ahmetanildindar

Just a general question: How would this impact functions like read_csv?

Thanks for taking a look - in read_csv, if parsing a date column fails, then the column is just kept as object

If a column or index cannot be represented as an array of datetimes, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type

So, if read_csv would previously have parsed a column whilst switching format midway, with this PDEP it wouldn't parse it and would return it unaltered as object data type

Example

current behaviour Swaps format midway

In [2]: pd.read_csv(io.StringIO('12-01-2000 00:00\n13-01-2000 00:00'), header=None, parse_dates=[0])[0]
Out[2]:
0   2000-12-01
1   2000-01-13
Name: 0, dtype: datetime64[ns]

new behaviour Can't parse according to first row, so returns column unaltered as object

In [2]: pd.read_csv(io.StringIO('12-01-2000 00:00\n13-01-2000 00:00'), header=None, parse_dates=[0])[0]
Out[2]:
0    12-01-2000 00:00
1    13-01-2000 00:00
Name: 0, dtype: object

Sep 18 '22 15:09 MarcoGorelli

Thank you for this! It looks like a really good solution.

I don't know the code well enough to know how feasible this is, but in case of format inference, would it be possible to inform the user of what format has been inferred? At least maybe in case of errors. The user might not realise that there is a ['12-01-2000 00:00:00', '13-01-2000 00:00:00'] lurking somewhere in the data.

Sep 18 '22 18:09 lordgrenville

Cheers! And yeah if there's an error and the user is using errors='coerce' (the default), then pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00']) would give

ValueError: time data '13-01-2000 00:00:00' does not match format '%m-%d-%Y %H:%M:%S' (match)

Sep 18 '22 18:09 MarcoGorelli

if errors=coerce then this wouldn't raise

Sep 18 '22 18:09 jreback

True, it would just show DatetimeIndex(['2000-12-01', 'NaT'], dtype='datetime64[ns]', freq=None)

Maybe there could be a verbose argument which prints out the inferred format, but I'd suggest we keep that to a separate discussion. For now, I think that if people just call .head() on the output, they can tell if the format has been inferred correctly

Sep 18 '22 18:09 MarcoGorelli

Maybe there could be a verbose argument which prints out the inferred format, but I'd suggest we keep that to a separate discussion. For now, I think that if people just call .head() on the output, they can tell if the format has been inferred correctly

Would it make sense to add guess_datetime_format to the public API instead (possibly renamed)?

from pandas._libs.tslibs.parsing import guess_datetime_format

print(guess_datetime_format('12-01-2000 00:00:00'))

# %m-%d-%Y %H:%M:%S

Sep 18 '22 21:09 rhshadrach

I don't think that what is proposed here (failing to parse date strings that don't match the inferred format) is the behavior you would want all the time. It might be a good option to have, controlled by a flag parameter.

I say that because the current behavior of infer_datetime_format is never harmful. It can speed up the parsing of a long list of dates, but it will never cause errors from date strings that could have been parsed correctly without this option. In real-world cases where the date format may not be 100% consistent, I think you want the parsing to succeed whenever it can. If it's really true that infer_datetime_format is purely a performance boost with no behavior change (and this is true as far as I know), maybe it should be enabled all the time (i.e. remove the option) or at least make it true by default.

The behavior proposed here is already available with the format option, but only if you already know what the format is. I can see where it would be valuable to be able to strictly enforce an unknown (inferred) format. So perhaps this behavior could be enabled with a flag such as infer_strict_format (default false). I can see the potential for confusion with the existing infer_datetime_format but that problem could be ameliorated, as suggested above, by removing the existing flag.

Sep 18 '22 23:09 scott-r

@scott-r Disagree with

...the current behavior of infer_datetime_format is never harmful

The harm is silently inferring two different formats, when in reality there is only one.

Sep 19 '22 06:09 lordgrenville

@scott-r Disagree with

...the current behavior of infer_datetime_format is never harmful

The harm is silently inferring two different formats, when in reality there is only one.

@lordgrenville To be clear, I used the phrase "never harmful" because infer_datetime_format controls a performance optimization that will not produce any more, or fewer, parsing errors than there would be if it were not used. Therefore, it's not clear to me why it even needs to exist as an option; it should be safe to enable it at all times.

Separate from that, I agree with you and @MarcoGorelli that sometimes you might want to enforce the restriction that all the date strings must adhere to a single format. That can be done today with the format option (when the single format is known) or by a new option (which I called infer_strict_format) that would enable the behavior @MarcoGorelli is proposing when the single format is not known at the outset. My only quibble is that there are valid use cases for both the current and proposed behavior; I don't think it's safe to enable the new behavior unconditionally.

Sep 19 '22 08:09 scott-r

@scott-r the current behaviour is unexpected to many and causes problems, see the many comments in https://github.com/pandas-dev/pandas/issues/12585 and the many linked issues

The behaviour of infer_datetime_format is also unexpected to many, see here, here and here, and many more

And sure, mixed formats might be a valid use case, but they can still be parsed using apply (where at least it's clear that each format will be guessed individually), e.g.

pd.Series(['12-01-2000 00:00:00', '13 January 2000']).apply(pd.to_datetime)

so I'd be -1 on adding another boolean argument

+1 lgtm.

Thanks Jeff! I'll wait to see if others have objections, then I'll change the status to accepted

Sep 19 '22 08:09 MarcoGorelli

The behaviour of infer_datetime_format is also unexpected to many, see here, here and here, and many more

@MarcoGorelli What if we were to change the current function of infer_datetime_format to what you are proposing, which would cause parsing errors when the format is different so that it matches expectations, then added a different flag (perhaps called try_fast_parsing or something) with the current behavior of infer_datetime_format? That way we could fix any confusion with the current functionality, but we also wouldn't lose the functionality and options to_datetime currently has.

Sep 19 '22 15:09 srotondo

Thanks for taking a look @srotondo

I don't think that would fix the confusion - in https://github.com/pandas-dev/pandas/issues/12585 and linked issues, people are using to_datetime without infer_datetime_format yet are still expecting consistent parsing

And if someone wants to retain the current functionality of guessing the format for each element individually, they could still do that with .apply(pd.to_datetime)

Furthermore, I'd be -1 on adding yet another boolean argument thus increasing complexity even more

Sep 19 '22 15:09 MarcoGorelli

I don't think that would fix the confusion - in #12585 and linked issues, people are using to_datetime without infer_datetime_format yet are still expecting consistent parsing

And if someone wants to retain the current functionality of guessing the format for each element individually, they could still do that with .apply(pd.to_datetime)

What if you keep infer_datetime_format, but change its type to str and the default behavior. Values could be (open to changing these):

consistent : the new default, based on this proposal. Use first item to determine format, retain that format throughout
inconsistent: the format may change based on how each element is parsed (current behavior with value of True ?)
None: Do not infer, and only use the value of format (current default with False ?)

So people with code that doesn't specify infer_datetime_format will get the new behavior. If someone wants the old behavior, they can ask for it. If they do specify it as True or False, we raise.

Sep 19 '22 16:09 Dr-Irv

And if someone wants to retain the current functionality of guessing the format for each element individually, they could still do that with .apply(pd.to_datetime)

@MarcoGorelli I suppose if the current functionality is still available through .apply(), that's probably fine, but do take care that any change you make to to_datetime doesn't affect the outcome of .apply(to_datetime) and also make it very clear in the documentation that that is how you get the old behavior. I'm just not very comfortable with completely removing the old functionality from to_datetime since even though you stated that many people expect this proposed behavior, it's possible some people use and rely on the current behavior. But as long as you can clearly and easily find how to get that old behavior, it's probably alright. Just please be sure that the current behavior isn't completely removed and is still accessible somehow.

Sep 19 '22 16:09 srotondo

@Dr-Irv sure, that'd be an option, but if there's gonna be a breaking change, I'd suggest we take the chance to simplify - having infer_datetime_format take three different options feels like too much complexity

Sep 19 '22 16:09 MarcoGorelli

+1 on the big picture. Implementation-wise, is the idea to pretty much replace our usage of dateutil.parser?

Sep 19 '22 17:09 jbrockmendel

If infer_datetime_format was deprecated, I do like @rhshadrach's suggestion https://github.com/pandas-dev/pandas/pull/48621#issuecomment-1250390925 of making guess_datetime_format public such that existing "functionality" isn't lost while making to_datetime stricter as the prior behavior would be essentially equivalent to:

to_datetime(arg, format=pd.tools.guess_datetime_format(arg[0]))

Sep 19 '22 17:09 mroeschke

+1 on the big picture. Implementation-wise, is the idea to pretty much replace our usage of dateutil.parser?

Thanks

dateutil.parser would still be used within guess_datetime_format, but then subsequent rows would be parsed with that guessed format rather than repeatedly calling dateutil.parser and risk having it silently switch format

I'm also on board with the suggestion to make guess_datetime_format public

Sep 19 '22 18:09 MarcoGorelli

Thanks all for your feedback, I'll incorporated some points into the document

There's been a couple of approvals, so for now I've changed the status to accepted

Sep 20 '22 09:09 MarcoGorelli

There's been 3 explicit approvals from core members, no "requested changes", and this has been open all week, so merging now

Thanks all for the discussion, aiming to start working on this soon-ish

Sep 23 '22 10:09 MarcoGorelli

I couldn't review before merged, but happy with the proposal here, really nice improvement.

I'm curious what's the suggested workaround mentioned in the PDEP for parsing columns with different formats. Should we mention it in the PDEP? Or the idea is to write it in the logs when implemented?

Also, seems like few formats are broken when rendered: https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html

Sep 24 '22 12:09 datapythonista

Thanks - yeah it's mentioned at the end of "detailed description"

If a user has dates in a mixed format, they can still use flexible parsing and accept the risks that poses, e.g.:
In [3]: pd.Series(['12-01-2000 00:00:00', '13-01-2000 00:00:00']).apply(pd.to_datetime)
Out[3]:
0   2000-12-01
1   2000-01-13
dtype: datetime64[ns]

Sep 24 '22 12:09 MarcoGorelli

I'm curious what's the suggested workaround mentioned in the PDEP for parsing columns with different formats.

One way I read this question (and not sure if it's the right way) is if a frame has two columns each with a different (but self-consistent) format. They would be parsed individually, is that correct @MarcoGorelli?

Sep 24 '22 13:09 rhshadrach

Ah apologies, I'd misunderstood

I presume you mean in read_csv? If so, then yes, e.g.:

data = io.StringIO('13-01-2000 00:00,13 Jan 2000\n14-01-2000 00:00,14 Jan 2000\n')
print(pd.read_csv(data, header=None, parse_dates=[0, 1]))

would give

           0          1
0 2000-01-13 2000-01-13
1 2000-01-14 2000-01-14

Each columns were parsed according to the format inferred from its respective first row.

If the first row of the first column had been 12-01-2000 00:00, then the inferred format would've been month-first, and so read_csv would've returned it unaltered as object, whilst the second column would still have been successfully converted

This wouldn't deviate from the current behaviour

Sep 24 '22 13:09 MarcoGorelli

Actually what I had in mind was answered in the first answer, I don't think my question was very clear. I missed the workaround when reading the PDEP, sorry.

Sep 24 '22 14:09 datapythonista

@MarcoGorelli was PDEP-4 fully implemented? Should we change its state to Implemented?

Also, I see there was a revision. Would it make sense to add the link to the revision PR to Discussion field, next to the original one?

Feb 27 '23 13:02 datapythonista