pandas icon indicating copy to clipboard operation
pandas copied to clipboard

Inconsistent date parsing of to_datetime

Open jorisvandenbossche opened this issue 8 years ago • 34 comments

As this comes up regularly:

  • [x] https://github.com/pydata/pandas/issues/12583
  • [x] https://github.com/pydata/pandas/issues/12501
  • [x] https://github.com/pydata/pandas/issues/7348
  • [x] https://github.com/pandas-dev/pandas/issues/15075
  • [x] #12611 deal with yearfirst/datefirst in read_csv
  • [x] #12730, compat with dateutil 2.5.2 dayfirst=True, yearfirst=True fixes
  • [x] https://github.com/pydata/pandas/issues/14301 raise if inconsistency in read_csv with dayfirst/yearfirst
  • [x] performance https://github.com/pandas-dev/pandas/issues/25848 The problem

Depending on the string formats, you can get inconsistent parsing (meaning: different interpretation of parts of the strings as days vs months between different values in the passed list/array of strings). For example:

In [22]: pd.to_datetime(["31/12/2014", "10/03/2011"])
Out[22]: DatetimeIndex(['2014-12-31', '2011-10-03'])

where the first is parsed as 'dayfirst' and the second not. Above particular case can be solved by passing dayfirst=True, but in any case, the default behaviour of to_datetime results in inconsistent parsing here. Another typical case if where you do provide dayfirst=True but have a malformed date, then it will fall back to 'not dayfirst' (https://github.com/pydata/pandas/issues/3341).

The reason

The reason is the flexibility of dateutil.parser.parse. This is in the first place also a feature, as it lets you parse messy datetime colums. But when it comes to flexibility regarding dayfirst/yearfirst, this gives unexpected results. So dateutil will 'try' to use dayfirst=True, but when this does not give a valid date, it will ignore dayfirst=True and parse the string anyway:

In [3]: dateutil.parser.parse('1/12/2012', dayfirst=True)
Out[3]: datetime.datetime(2012, 12, 1, 0, 0)

In [4]: dateutil.parser.parse('1/13/2012', dayfirst=True)
Out[4]: datetime.datetime(2012, 1, 13, 0, 0)

The same is true for dayfirst=False (the default) as this will also try with 'day first' if the other way around failed. The issue for the lack of strictness of dayfirst is actually https://github.com/pydata/pandas/issues/3341.

Possible solutions

  • In some cases passing dayfirst=True is sufficient (if you don't have messy or malformed dates).
  • The typical answer is saying: "if you want to be sure to have strict and consistent parsing, provide a format argument. We should probably stress this more in the docstring of to_datetime

But, regardless of the two points above, fact is that the default behaviour of to_datetime (without passing any arguments) can do inconsistent parsing and so give unexpected results. Possible ways to have better default behaviour:

  • Have the possibility to specify that dateutil should be strict about dayfirst. I raised an issue for that here: https://github.com/dateutil/dateutil/issues/214
  • Changing the default of infer_datetime_format to True would solve it in some cases (where the first datetime has the correct format you want), but certainly not all.
  • The idea of raising a warning when inconsistent parsing happened, is AFAIK not possible as pandas does not know how dateutil parsed the dates. And always warning when no format is passed or could be inferred (so in practice, when dateutil is used), is maybe to drastically, as the flexible parsing is also a feature.

jorisvandenbossche avatar Mar 10 '16 10:03 jorisvandenbossche

So we ONLY hit the dateutil path (and hence use dayfirst/yearfirst settings) IF we fail to parse ISO8601 (IOW an exception is raised).

So easiest thing to do is track in array_to_datetime when/if this happens.

  • If we don't trigger the exception at all, then all dates are consistent (as ISO 8601, meaning dayfirst=False, yearfirst=False).
  • If we have at least 1 time we hit the exception (but not all), then we have a potential ambiguity (meaning some dates were parsed as ISO 8601 and some maybe were not)
  • If we hit the exception EVERY time, then we have NO ISO8601 dates and I think the dayfirst/yearfirst flags would make this unambiguous (somewhat depends on dateutil).

So this is a relatively easy first step, we could easily raise an ambiguity error and force the user to figure it out (unless coerceing of course).

We could also (and maybe wait for an ambiguity error to occur), try several formats to see what parses.

These options take into account all dates, to make this more performance we might want to take a sample (or a sample of the uniques is best).

jreback avatar Mar 10 '16 13:03 jreback

The dayfirst parameter to the reading functions such as read_csv is even more confusing, as multiple datetime columns with different formats might co-exist in the same dataset.

dolaameng avatar Mar 11 '16 05:03 dolaameng

Since mixed date formats within one series is unusual, why not raise a warning every time there are ambiguous values?

lordgrenville avatar Aug 16 '18 13:08 lordgrenville

Since mixed date formats within one series is unusual, why not raise a warning every time there are ambiguous values?

I am not sure there is way to get from dateutil how (assuming which format) it parsed the date, so then it will also be difficult to warn for different formats.

jorisvandenbossche avatar Aug 16 '18 14:08 jorisvandenbossche

Just a minimal example a colleague created as we stumbled over this:

import pandas as pd
from io import StringIO

DATA=StringIO("""timestamp,num
03/09/2018 12:02:00,1
14/09/2018 14:09:00,2
""")
df = pd.read_csv(DATA, parse_dates=['timestamp'])

gives

            timestamp  num
0 2018-03-09 12:02:00    1
1 2018-09-14 14:09:00    2

but it should either have raised an exception or better give

            timestamp  num
0 2018-09-03 12:02:00    1   <----------- wrong month/day order!
1 2018-09-14 14:09:00    2

MartinThoma avatar Feb 11 '19 18:02 MartinThoma

@MartinThoma that seems like a bug, not necessarily related to this issue. Can you open a new issue for it?

jorisvandenbossche avatar Feb 14 '19 22:02 jorisvandenbossche

@jorisvandenbossche Sure, I can do that. But looking at your first post (section "The problem") I'm pretty sure it is exactly the same problem as here.

MartinThoma avatar Feb 15 '19 05:02 MartinThoma

Ah, sorry, missed the fact that this is an "unorthodox" format (the default format (at least in US) is month first, so month/day/year). Yes, then the behaviour you see is totally "expected" (given the current rules as outlined above). The best thing you can do is specify a format (or specify dayfirst=True, but I wouldn't recommend that for its non-strictness explained above).

jorisvandenbossche avatar Feb 15 '19 07:02 jorisvandenbossche

I also came across this issue of inconsistent parsing of non-ISO8601 formats (i.e. MMDDYYYY and DDMMYYYY). I understand this is a known issue involving dateutil but maybe we can enhance infer_datetime_format so there is a format to parse towards in order to avoid such ambiguity.

_guess_datetime_format_for_array only looks at first non-null value. Potentially we (happy to contribute myself) can make it more robust by searching for inconsistency and set dayfirst=True if formats inferred in the array are not unique. If still not unique, raise exception so that to_datetime with infer_datetime_format=True gives strict parsing.

@jorisvandenbossche @jreback your thoughts?

minggli avatar Feb 16 '19 21:02 minggli

you would do this in array_to_datetime

once a date is parsed with dayfirst or yearfirst then it would be an error to not parse others in the same way

jreback avatar Feb 16 '19 21:02 jreback

I have looked into this possibility however, array_to_datetime utimately uses dateutil.parser.parse which operates a soft dayfirst kwarg and reverts silently if not match. I don't see how this parse function gives any callback in terms of whether dayfirst or yearfirst. So I suppose changes have to be sought in dateutil

Alternatively, we can make enhance _guess_datetime_format_for_array and raise error if the array has more than one datetime formats inferred. It currently only check first item and therefore is not statistically significant in terms of the conclusion.

minggli avatar Feb 24 '19 14:02 minggli

I think the title and the discussion fail to make explicit the enormity of the issue that, in pandas, a column can contain dates formatted in different ways, so that one row is dd-mm-yyyy and another row is mm-dd-yyyy . This might sort of maybe be inferred reading here, but in no way is it explicit. An example with minimal code to reproduce this is in my other post referenced just above: https://github.com/pandas-dev/pandas/issues/25847 Try telling this to anyone with a SQL background and watch them drop dead! :)

IMHO this needs to be added to the docs as a matter of urgency (I am, in fact, drafting something I hope the community will accept).

A few more points:

  • this happens with read_csv(), as per the example linked above. So I guess that read_csv uses to_datetime ? I haven't found it mentioned clearly in the docs.

  • @jreback , what is array_to_datetime which you mention? Is it a function? Of which library? I couldn't find it. If i search the pandas docs for it, nothing comes up: https://pandas.pydata.org/pandas-docs/stable/search.html?q=array_to_datetime&check_keywords=yes&area=default After some googling, it seems it was part of pandas.tslib which has now been deprecated - is that correct? Did it solve the problem, as @jreback said, or not, as @minggli reported?

  • Is there any way to force a column in a pandas dataframe to contain items of the same type? Ie to get pandas to behave like SQL, so that you can specify a date format for a column, and be sure the whole column will follow that format?

  • The docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html?highlight=to_datetime#pandas.to_datetime say that to_datetime returns one of these 3:

  1. list-like: DatetimeIndex
    
  2. Series: Series of datetime64 dtype
    
  3. scalar: Timestamp
    

However, in the code linked above, it returns a Series of dtype object, NOT of dtype datetime64 as stated in the docs

garfieldthecat avatar Mar 23 '19 21:03 garfieldthecat

@garfieldthecat pls have a read thru the very extensive documentation this having mixed dayfirst datetimes is only a problem at parse time; you can specify multiple options to properly parse including dayfirst and format strings

pandas of course has single dtypes for efficiency ; though allows object for mixed tyoes

jreback avatar Mar 23 '19 23:03 jreback

this having mixed dayfirst datetimes is only a problem at parse time; you can specify multiple options to properly parse including dayfirst and format strings

Only? Really? I am simply speechless. It seems clear you totally fail to grasp how huge an issue this is. Look, I have no interest in convincing you of anything - my point is simply that this can be a huge issue for many users (even if it doesn't affect you and your workflow), that this is not clear at all from the docs, and that it must be made so.

For example, anyone who has been importing csv with dates, and who didn't explicitly state the format because they thought there could be no ambiguity (if their days go all the way to the 31st) and who didn't realise that a column can contain dates in different formats, may have imported dates incorrectly and therefore any function, analysis etc based on those dates may be wrong. This is not big, it's HUGE.

@garfieldthecat pls have a read thru the very extensive documentation

Could I please ask you to be a bit more specific on what can be found where in the extensive documentation? Like I said:

  • array_to_datetime seems to have been deprecated. Is that so? If yes, do you know of an up to date alternative? If not, where is it documented, since it's not in the pandas docs?

  • the pandas docs are wrong in saying what kind of objects to_datetime can return, as I explained above. Do you agree? And can I therefore continue to work on my draft of what I think needs updating in the pandas docs? Or am I missing something?

  • I didn't find any explicit mention that read_csv uses to_datetime. Without this clarification, it is not clear at all that to_datetime affects the output of read_csv. The title of this discussion should really be amended to reflect that this bug 1) affects read_csv and 2) means that a pandas column can have dates with different formats, so that one row is dd-mm-yyyy and another mm-dd-yyyy.

  • is there another solution, other than always explicitly specifying the date format, which is in the docs and which I missed?

By the way, what saved me was that I almost always load the data into a SQL server after cleansing it in pandas; SQL would have picked up any inconsistency in the date formats. Had I not had this SQL step in my workflow, I'd probably have to double check YEARS of code and analyses to make sure that nothing was messed up.

garfieldthecat avatar Mar 24 '19 09:03 garfieldthecat

well see:http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table for date parsing, the doc-string: http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html?highlight=to_datetime#pandas.to_datetime

I am not sure what else you are refering, array_to_datetime is an internal implementation, so if you want to fix this issue, then certainly look at that.

the pandas docs are wrong in saying what kind of objects to_datetime can return, as I explained above. Do you agree? And can I therefore continue to work on my draft of what I think needs updating in the pandas docs? Or am I missing something?

and exactly how are they wrong? these are the things that are returned.

You are welcome to propose updates to doc-strings and/or docs, but I would first review in detail what pandas has to offer.

jreback avatar Mar 24 '19 18:03 jreback

Again, unless I missed it, the links you post do not clearly explain that read_csv relies on to_datetime - which is why I didn't find this issue: I was looking for issue with read_csv

Again, unless I missed it, they also do not clearly explain that, in pandas, one row in a date column can be mm-dd-yyyy while another row in the same column can be dd-mm-yyyy. This is something that should be written in huge bold letters in the documentation of read_csv

I have drafted something, but what is the best way to propose changes to the documentation (not to the code)? Discuss it on a github forum first to reach a consensus? Just write it and post it to github?

PS about what to_datetime() returns: I was sure I remembered it returned dtype object. I guess I was wrong - apologies.

garfieldthecat avatar Mar 24 '19 23:03 garfieldthecat

you are welcome to submit a PR: http://pandas.pydata.org/pandas-docs/stable/development/contributing.html

jreback avatar Mar 24 '19 23:03 jreback

Agreeing with @garfieldthecat - this is a very serious problem - in that correctly formatted dd/mm datetimes will often / usually be garbled by pandas using the default setting, because pandas can and will reverse the dd/mm within a column, as it is guessing separately for each date. It's very easy for the user to miss that this has happened.

I don't think there's any common use for the current behavior, where the guess is per-entry, so:

  • By default, the guessed format should be the same for each date in the series. I would personally prefer an incorrect guess of mm/dd for the first value, and the same incorrect guess for all subsequent values, to the current behavior.
  • The ideal would be to use a guess that gives a valid date for all dates in the series. This would involve back-tracking to switch days / months when the current date proves valid for dd/mm but not mm/dd. This would only be triggered for the dangerous cases, so even with the performance hit, I think that's an improvement.
  • If anyone does have a use for switching formats mid-series, I guess this could be a keyword option. To me, that seems so pathological, that it's reasonable to ask the user to write a custom function for that case.

matthew-brett avatar May 14 '19 12:05 matthew-brett

cc @anmyachev who recently update the internal parsing of dayfirst to be in cython. A patch for this issue would now be somewhat performant.

jreback avatar May 14 '19 13:05 jreback

pandas is of course a completely volunteer project with almost 3000 issues, so statements like this

this is a very serious problem

are not very helpful to anyhow. pull-requests are the way to solve issues.

jreback avatar May 14 '19 13:05 jreback

I love Pandas and use it all the time; I am sorry if I did not appear grateful for the hard work that goes into it. I'm sure you know, but I do contribute to Pandas from time to time, and other foundation libraries, so I know this is work, and I'm happy to help. I say 'serious' to be helpful, not critical - in the same way I would put an importance label on a bug, on a bug-tracker. I'm discussing here rather than on a PR, because I don't know my way around the code, or who is invested in it, yet.

matthew-brett avatar May 14 '19 13:05 matthew-brett

@matthew-brett didn't really mean to call you out :-D, more specifically about prioritization of anything in pandas is subject to resource availability, meaning you can't prioritize, no matter how 'important' something is. Things get prioritized really when someone (include a core dev) submits a PR.

jreback avatar May 14 '19 14:05 jreback

Yes, sorry, I realize 'very serious problem' does sound as if I mean that you owe me something, and it's obviously the other way round, so I'm sorry for that implication, I didn't mean it that way, honest.

matthew-brett avatar May 14 '19 14:05 matthew-brett

I am new to Pandas and github - please be gentle.

If I understand the thread correctly - this is an unaddressed issue for 3.5 years? Don't you think that explicitly coding how a date should be read is an unreasonable solution? It's like importing integers and coding "randomly_scramble_my_number=False". This error is enough to ban use of Pandas in some large organisations :(

erpda avatar Feb 25 '20 18:02 erpda

I am new to Pandas and github - please be gentle.

If I understand the thread correctly - this is an unaddressed issue for 3.5 years? Don't you think that explicitly coding how a date should be read is an unreasonable solution? It's like importing integers and coding "randomly_scramble_my_number=False". This error is enough to ban use of Pandas in some large organisations :(

@erpda

There are 3000+ open issues in pandas. How should the handful of mostly volunteer folks prioritize? if large orgs want to just take and not help fix then I am not sure what to say.

open source is about finding, reporting, AND helping to fix. there are a many orgs and many many individual contributors to pandas and open source generally that have built an amazing eco system.

Would love for large orgs to step up and seriously fund open source. just taking and not giving back at all IMHO is terrible.

jreback avatar Feb 25 '20 22:02 jreback

Dear all,

It seems to_datetime is causing significant problem. Here are the codes I've ran. When compare the EARTHQUAKEDATE fields of the original and copied and to_datetime applied dataframes, I see the values are unfortunately different.

image

image image

Original is in DD/MM/YYYY hh:mm:ss format EARTHQUAKEDATE 05/10/1977 05:34:43

But modified is in YYYY/DD/MM hh:mm:ss format EARTHQUAKEDATE 1977-05-10 05:34:43

I'd would expect the result as YYYY-MM-DD hh:mm:ss.

I'd like to do anything I can for the solution.

ahmetanildindar avatar May 06 '20 10:05 ahmetanildindar

I've stumbled across this issue in working with our scientific dataset. Specifying the dayfirst argument to read_csv sorted us out. I would agree with the suggestion of printing a warning when inconsistent datetime parsing may be occurring. Can this be done after parsing by checking for inconsistent mapping from the string input to the datetime output across the whole series? Would this introduce a performance penalty?

I'm keen to contribute a fix but I'm a newbie to the pandas project.

Craskermasker avatar Jul 01 '20 13:07 Craskermasker

@Craskermasker happy to have a contribution towards this

jreback avatar Jul 02 '20 01:07 jreback

I barely caught this error when trying to process a datasett, and I feel that it should be addressed. My first thought was to make dayfirst True as default, but this does not work as if you input 10.20.2020 it converts it to 2020-10-20 without throwing an error :open_mouth:.

The only solution I see for solving this is to force the use of format, including making it a mandatory argument when using parse_dates in read_csv/read_tsv

In read_csv there should be a check that all dates are converted using the same format.From digging through the source code I would expect it to be this needs to be implemented in https://github.com/pandas-dev/pandas/blob/2990c1ddd137fcaef46d2cde47334314047a580a/pandas/_libs/tslibs/parsing.pyx#L591

In the mean time I have reverted to using numpy.genfromtxt

paalge avatar Jul 02 '20 13:07 paalge

Just to give an update on this, the example from this issue will now show a warning:

In [1]: import pandas as pd

In [2]: pd.to_datetime(["31/12/2014", "10/03/2011"])
<ipython-input-2-9e192223af3c>:1: UserWarning: Parsing '31/12/2014' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
  pd.to_datetime(["31/12/2014", "10/03/2011"])
Out[2]: DatetimeIndex(['2014-12-31', '2011-10-03'], dtype='datetime64[ns]', freq=None)

What's missing is to show a warning when the date string isn't a delimited date string (e.g. "31/12/2014 00:00:00") - this would involve adding a warning to when du_parse is called, and would result in many false positives, but maybe that's OK

MarcoGorelli avatar Aug 27 '21 08:08 MarcoGorelli