PDEP-4: consistent parsing of datetimes
Here's my proposal to address https://github.com/pandas-dev/pandas/issues/12585

Dear @MarcoGorelli, I have read your commit and it seems ok to me.
Ahmet
Just a general question: How would this impact functions like
read_csv?
Thanks for taking a look - in read_csv, if parsing a date column fails, then the column is just kept as object
If a column or index cannot be represented as an array of datetimes, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type
So, if read_csv would previously have parsed a column whilst switching format midway, with this PDEP it wouldn't parse it and would return it unaltered as object data type
Example
current behaviour Swaps format midway
In [2]: pd.read_csv(io.StringIO('12-01-2000 00:00\n13-01-2000 00:00'), header=None, parse_dates=[0])[0]
Out[2]:
0 2000-12-01
1 2000-01-13
Name: 0, dtype: datetime64[ns]
new behaviour
Can't parse according to first row, so returns column unaltered as object
In [2]: pd.read_csv(io.StringIO('12-01-2000 00:00\n13-01-2000 00:00'), header=None, parse_dates=[0])[0]
Out[2]:
0 12-01-2000 00:00
1 13-01-2000 00:00
Name: 0, dtype: object
Thank you for this! It looks like a really good solution.
I don't know the code well enough to know how feasible this is, but in case of format inference, would it be possible to inform the user of what format has been inferred? At least maybe in case of errors. The user might not realise that there is a ['12-01-2000 00:00:00', '13-01-2000 00:00:00'] lurking somewhere in the data.
Cheers! And yeah if there's an error and the user is using errors='coerce' (the default), then pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00']) would give
ValueError: time data '13-01-2000 00:00:00' does not match format '%m-%d-%Y %H:%M:%S' (match)
if errors=coerce then this wouldn't raise
True, it would just show DatetimeIndex(['2000-12-01', 'NaT'], dtype='datetime64[ns]', freq=None)
Maybe there could be a verbose argument which prints out the inferred format, but I'd suggest we keep that to a separate discussion. For now, I think that if people just call .head() on the output, they can tell if the format has been inferred correctly
Maybe there could be a
verboseargument which prints out the inferred format, but I'd suggest we keep that to a separate discussion. For now, I think that if people just call.head()on the output, they can tell if the format has been inferred correctly
Would it make sense to add guess_datetime_format to the public API instead (possibly renamed)?
from pandas._libs.tslibs.parsing import guess_datetime_format
print(guess_datetime_format('12-01-2000 00:00:00'))
# %m-%d-%Y %H:%M:%S
I don't think that what is proposed here (failing to parse date strings that don't match the inferred format) is the behavior you would want all the time. It might be a good option to have, controlled by a flag parameter.
I say that because the current behavior of infer_datetime_format is never harmful. It can speed up the parsing of a long list of dates, but it will never cause errors from date strings that could have been parsed correctly without this option. In real-world cases where the date format may not be 100% consistent, I think you want the parsing to succeed whenever it can. If it's really true that infer_datetime_format is purely a performance boost with no behavior change (and this is true as far as I know), maybe it should be enabled all the time (i.e. remove the option) or at least make it true by default.
The behavior proposed here is already available with the format option, but only if you already know what the format is. I can see where it would be valuable to be able to strictly enforce an unknown (inferred) format. So perhaps this behavior could be enabled with a flag such as infer_strict_format (default false). I can see the potential for confusion with the existing infer_datetime_format but that problem could be ameliorated, as suggested above, by removing the existing flag.
@scott-r Disagree with
...the current behavior of
infer_datetime_formatis never harmful
The harm is silently inferring two different formats, when in reality there is only one.
@scott-r Disagree with
...the current behavior of
infer_datetime_formatis never harmfulThe harm is silently inferring two different formats, when in reality there is only one.
@lordgrenville To be clear, I used the phrase "never harmful" because infer_datetime_format controls a performance optimization that will not produce any more, or fewer, parsing errors than there would be if it were not used. Therefore, it's not clear to me why it even needs to exist as an option; it should be safe to enable it at all times.
Separate from that, I agree with you and @MarcoGorelli that sometimes you might want to enforce the restriction that all the date strings must adhere to a single format. That can be done today with the format option (when the single format is known) or by a new option (which I called infer_strict_format) that would enable the behavior @MarcoGorelli is proposing when the single format is not known at the outset. My only quibble is that there are valid use cases for both the current and proposed behavior; I don't think it's safe to enable the new behavior unconditionally.
@scott-r the current behaviour is unexpected to many and causes problems, see the many comments in https://github.com/pandas-dev/pandas/issues/12585 and the many linked issues
The behaviour of infer_datetime_format is also unexpected to many, see here, here and here, and many more
And sure, mixed formats might be a valid use case, but they can still be parsed using apply (where at least it's clear that each format will be guessed individually), e.g.
pd.Series(['12-01-2000 00:00:00', '13 January 2000']).apply(pd.to_datetime)
so I'd be -1 on adding another boolean argument
+1 lgtm.
Thanks Jeff! I'll wait to see if others have objections, then I'll change the status to accepted
The behaviour of
infer_datetime_formatis also unexpected to many, see here, here and here, and many more
@MarcoGorelli What if we were to change the current function of infer_datetime_format to what you are proposing, which would cause parsing errors when the format is different so that it matches expectations, then added a different flag (perhaps called try_fast_parsing or something) with the current behavior of infer_datetime_format? That way we could fix any confusion with the current functionality, but we also wouldn't lose the functionality and options to_datetime currently has.
Thanks for taking a look @srotondo
I don't think that would fix the confusion - in https://github.com/pandas-dev/pandas/issues/12585 and linked issues, people are using to_datetime without infer_datetime_format yet are still expecting consistent parsing
And if someone wants to retain the current functionality of guessing the format for each element individually, they could still do that with .apply(pd.to_datetime)
Furthermore, I'd be -1 on adding yet another boolean argument thus increasing complexity even more
I don't think that would fix the confusion - in #12585 and linked issues, people are using
to_datetimewithoutinfer_datetime_formatyet are still expecting consistent parsingAnd if someone wants to retain the current functionality of guessing the format for each element individually, they could still do that with
.apply(pd.to_datetime)
What if you keep infer_datetime_format, but change its type to str and the default behavior. Values could be (open to changing these):
consistent: the new default, based on this proposal. Use first item to determine format, retain that format throughoutinconsistent: the format may change based on how each element is parsed (current behavior with value ofTrue?)None: Do not infer, and only use the value offormat(current default withFalse?)
So people with code that doesn't specify infer_datetime_format will get the new behavior. If someone wants the old behavior, they can ask for it. If they do specify it as True or False, we raise.
And if someone wants to retain the current functionality of guessing the format for each element individually, they could still do that with
.apply(pd.to_datetime)
@MarcoGorelli I suppose if the current functionality is still available through .apply(), that's probably fine, but do take care that any change you make to to_datetime doesn't affect the outcome of .apply(to_datetime) and also make it very clear in the documentation that that is how you get the old behavior. I'm just not very comfortable with completely removing the old functionality from to_datetime since even though you stated that many people expect this proposed behavior, it's possible some people use and rely on the current behavior. But as long as you can clearly and easily find how to get that old behavior, it's probably alright. Just please be sure that the current behavior isn't completely removed and is still accessible somehow.
@Dr-Irv sure, that'd be an option, but if there's gonna be a breaking change, I'd suggest we take the chance to simplify - having infer_datetime_format take three different options feels like too much complexity
+1 on the big picture. Implementation-wise, is the idea to pretty much replace our usage of dateutil.parser?
If infer_datetime_format was deprecated, I do like @rhshadrach's suggestion https://github.com/pandas-dev/pandas/pull/48621#issuecomment-1250390925 of making guess_datetime_format public such that existing "functionality" isn't lost while making to_datetime stricter as the prior behavior would be essentially equivalent to:
to_datetime(arg, format=pd.tools.guess_datetime_format(arg[0]))
+1 on the big picture. Implementation-wise, is the idea to pretty much replace our usage of dateutil.parser?
Thanks
dateutil.parser would still be used within guess_datetime_format, but then subsequent rows would be parsed with that guessed format rather than repeatedly calling dateutil.parser and risk having it silently switch format
I'm also on board with the suggestion to make guess_datetime_format public
Thanks all for your feedback, I'll incorporated some points into the document
There's been a couple of approvals, so for now I've changed the status to accepted
There's been 3 explicit approvals from core members, no "requested changes", and this has been open all week, so merging now
Thanks all for the discussion, aiming to start working on this soon-ish
I couldn't review before merged, but happy with the proposal here, really nice improvement.
I'm curious what's the suggested workaround mentioned in the PDEP for parsing columns with different formats. Should we mention it in the PDEP? Or the idea is to write it in the logs when implemented?
Also, seems like few formats are broken when rendered: https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html
Thanks - yeah it's mentioned at the end of "detailed description"
If a user has dates in a mixed format, they can still use flexible parsing and accept the risks that poses, e.g.:
In [3]: pd.Series(['12-01-2000 00:00:00', '13-01-2000 00:00:00']).apply(pd.to_datetime) Out[3]: 0 2000-12-01 1 2000-01-13 dtype: datetime64[ns]
I'm curious what's the suggested workaround mentioned in the PDEP for parsing columns with different formats.
One way I read this question (and not sure if it's the right way) is if a frame has two columns each with a different (but self-consistent) format. They would be parsed individually, is that correct @MarcoGorelli?
Ah apologies, I'd misunderstood
I presume you mean in read_csv? If so, then yes, e.g.:
data = io.StringIO('13-01-2000 00:00,13 Jan 2000\n14-01-2000 00:00,14 Jan 2000\n')
print(pd.read_csv(data, header=None, parse_dates=[0, 1]))
would give
0 1
0 2000-01-13 2000-01-13
1 2000-01-14 2000-01-14
Each columns were parsed according to the format inferred from its respective first row.
If the first row of the first column had been 12-01-2000 00:00, then the inferred format would've been month-first, and so read_csv would've returned it unaltered as object, whilst the second column would still have been successfully converted
This wouldn't deviate from the current behaviour
Actually what I had in mind was answered in the first answer, I don't think my question was very clear. I missed the workaround when reading the PDEP, sorry.
@MarcoGorelli was PDEP-4 fully implemented? Should we change its state to Implemented?
Also, I see there was a revision. Would it make sense to add the link to the revision PR to Discussion field, next to the original one?