pandas
pandas copied to clipboard
Inconsistent date parsing of to_datetime
As this comes up regularly:
- [x] https://github.com/pydata/pandas/issues/12583
- [x] https://github.com/pydata/pandas/issues/12501
- [x] https://github.com/pydata/pandas/issues/7348
- [x] https://github.com/pandas-dev/pandas/issues/15075
- [x] #12611 deal with
yearfirst/datefirst
inread_csv
- [x] #12730, compat with dateutil 2.5.2
dayfirst=True, yearfirst=True
fixes - [x] https://github.com/pydata/pandas/issues/14301 raise if inconsistency in
read_csv
withdayfirst/yearfirst
- [x] performance https://github.com/pandas-dev/pandas/issues/25848 The problem
Depending on the string formats, you can get inconsistent parsing (meaning: different interpretation of parts of the strings as days vs months between different values in the passed list/array of strings). For example:
In [22]: pd.to_datetime(["31/12/2014", "10/03/2011"])
Out[22]: DatetimeIndex(['2014-12-31', '2011-10-03'])
where the first is parsed as 'dayfirst' and the second not. Above particular case can be solved by passing dayfirst=True
, but in any case, the default behaviour of to_datetime
results in inconsistent parsing here.
Another typical case if where you do provide dayfirst=True
but have a malformed date, then it will fall back to 'not dayfirst' (https://github.com/pydata/pandas/issues/3341).
The reason
The reason is the flexibility of dateutil.parser.parse
. This is in the first place also a feature, as it lets you parse messy datetime colums. But when it comes to flexibility regarding dayfirst
/yearfirst
, this gives unexpected results.
So dateutil will 'try' to use dayfirst=True
, but when this does not give a valid date, it will ignore dayfirst=True
and parse the string anyway:
In [3]: dateutil.parser.parse('1/12/2012', dayfirst=True)
Out[3]: datetime.datetime(2012, 12, 1, 0, 0)
In [4]: dateutil.parser.parse('1/13/2012', dayfirst=True)
Out[4]: datetime.datetime(2012, 1, 13, 0, 0)
The same is true for dayfirst=False
(the default) as this will also try with 'day first' if the other way around failed.
The issue for the lack of strictness of dayfirst is actually https://github.com/pydata/pandas/issues/3341.
Possible solutions
- In some cases passing
dayfirst=True
is sufficient (if you don't have messy or malformed dates). - The typical answer is saying: "if you want to be sure to have strict and consistent parsing, provide a
format
argument. We should probably stress this more in the docstring ofto_datetime
But, regardless of the two points above, fact is that the default behaviour of to_datetime
(without passing any arguments) can do inconsistent parsing and so give unexpected results.
Possible ways to have better default behaviour:
- Have the possibility to specify that
dateutil
should be strict aboutdayfirst
. I raised an issue for that here: https://github.com/dateutil/dateutil/issues/214 - Changing the default of
infer_datetime_format
toTrue
would solve it in some cases (where the first datetime has the correct format you want), but certainly not all. - The idea of raising a warning when inconsistent parsing happened, is AFAIK not possible as pandas does not know how dateutil parsed the dates. And always warning when no format is passed or could be inferred (so in practice, when dateutil is used), is maybe to drastically, as the flexible parsing is also a feature.
So we ONLY hit the dateutil path (and hence use dayfirst/yearfirst
settings) IF we fail to parse ISO8601 (IOW an exception is raised).
So easiest thing to do is track in array_to_datetime
when/if this happens.
- If we don't trigger the exception at all, then all dates are consistent (as ISO 8601, meaning
dayfirst=False
,yearfirst=False
). - If we have at least 1 time we hit the exception (but not all), then we have a potential ambiguity (meaning some dates were parsed as ISO 8601 and some maybe were not)
- If we hit the exception EVERY time, then we have NO ISO8601 dates and I think the
dayfirst/yearfirst
flags would make this unambiguous (somewhat depends ondateutil
).
So this is a relatively easy first step, we could easily raise an ambiguity error and force the user to figure it out (unless coerceing of course).
We could also (and maybe wait for an ambiguity error to occur), try several formats to see what parses.
These options take into account all dates, to make this more performance we might want to take a sample (or a sample of the uniques is best).
The dayfirst
parameter to the reading functions such as read_csv
is even more confusing, as multiple datetime columns with different formats might co-exist in the same dataset.
Since mixed date formats within one series is unusual, why not raise a warning every time there are ambiguous values?
Since mixed date formats within one series is unusual, why not raise a warning every time there are ambiguous values?
I am not sure there is way to get from dateutil
how (assuming which format) it parsed the date, so then it will also be difficult to warn for different formats.
Just a minimal example a colleague created as we stumbled over this:
import pandas as pd
from io import StringIO
DATA=StringIO("""timestamp,num
03/09/2018 12:02:00,1
14/09/2018 14:09:00,2
""")
df = pd.read_csv(DATA, parse_dates=['timestamp'])
gives
timestamp num
0 2018-03-09 12:02:00 1
1 2018-09-14 14:09:00 2
but it should either have raised an exception or better give
timestamp num
0 2018-09-03 12:02:00 1 <----------- wrong month/day order!
1 2018-09-14 14:09:00 2
@MartinThoma that seems like a bug, not necessarily related to this issue. Can you open a new issue for it?
@jorisvandenbossche Sure, I can do that. But looking at your first post (section "The problem") I'm pretty sure it is exactly the same problem as here.
Ah, sorry, missed the fact that this is an "unorthodox" format (the default format (at least in US) is month first, so month/day/year). Yes, then the behaviour you see is totally "expected" (given the current rules as outlined above). The best thing you can do is specify a format
(or specify dayfirst=True
, but I wouldn't recommend that for its non-strictness explained above).
I also came across this issue of inconsistent parsing of non-ISO8601 formats (i.e. MMDDYYYY and DDMMYYYY). I understand this is a known issue involving dateutil
but maybe we can enhance infer_datetime_format
so there is a format to parse towards in order to avoid such ambiguity.
_guess_datetime_format_for_array
only looks at first non-null value. Potentially we (happy to contribute myself) can make it more robust by searching for inconsistency and set dayfirst=True
if formats inferred in the array are not unique. If still not unique, raise exception so that to_datetime
with infer_datetime_format=True
gives strict parsing.
@jorisvandenbossche @jreback your thoughts?
you would do this in array_to_datetime
once a date is parsed with dayfirst or yearfirst then it would be an error to not parse others in the same way
I have looked into this possibility however, array_to_datetime
utimately uses dateutil.parser.parse
which operates a soft dayfirst
kwarg and reverts silently if not match. I don't see how this parse
function gives any callback in terms of whether dayfirst
or yearfirst
. So I suppose changes have to be sought in dateutil
Alternatively, we can make enhance _guess_datetime_format_for_array
and raise error if the array has more than one datetime formats inferred. It currently only check first item and therefore is not statistically significant in terms of the conclusion.
I think the title and the discussion fail to make explicit the enormity of the issue that, in pandas, a column can contain dates formatted in different ways, so that one row is dd-mm-yyyy and another row is mm-dd-yyyy . This might sort of maybe be inferred reading here, but in no way is it explicit. An example with minimal code to reproduce this is in my other post referenced just above: https://github.com/pandas-dev/pandas/issues/25847 Try telling this to anyone with a SQL background and watch them drop dead! :)
IMHO this needs to be added to the docs as a matter of urgency (I am, in fact, drafting something I hope the community will accept).
A few more points:
-
this happens with read_csv(), as per the example linked above. So I guess that read_csv uses to_datetime ? I haven't found it mentioned clearly in the docs.
-
@jreback , what is array_to_datetime which you mention? Is it a function? Of which library? I couldn't find it. If i search the pandas docs for it, nothing comes up: https://pandas.pydata.org/pandas-docs/stable/search.html?q=array_to_datetime&check_keywords=yes&area=default After some googling, it seems it was part of pandas.tslib which has now been deprecated - is that correct? Did it solve the problem, as @jreback said, or not, as @minggli reported?
-
Is there any way to force a column in a pandas dataframe to contain items of the same type? Ie to get pandas to behave like SQL, so that you can specify a date format for a column, and be sure the whole column will follow that format?
-
The docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html?highlight=to_datetime#pandas.to_datetime say that to_datetime returns one of these 3:
-
list-like: DatetimeIndex
-
Series: Series of datetime64 dtype
-
scalar: Timestamp
However, in the code linked above, it returns a Series of dtype object, NOT of dtype datetime64 as stated in the docs
@garfieldthecat pls have a read thru the very extensive documentation this having mixed dayfirst datetimes is only a problem at parse time; you can specify multiple options to properly parse including dayfirst and format strings
pandas of course has single dtypes for efficiency ; though allows object for mixed tyoes
this having mixed dayfirst datetimes is only a problem at parse time; you can specify multiple options to properly parse including dayfirst and format strings
Only? Really? I am simply speechless. It seems clear you totally fail to grasp how huge an issue this is. Look, I have no interest in convincing you of anything - my point is simply that this can be a huge issue for many users (even if it doesn't affect you and your workflow), that this is not clear at all from the docs, and that it must be made so.
For example, anyone who has been importing csv with dates, and who didn't explicitly state the format because they thought there could be no ambiguity (if their days go all the way to the 31st) and who didn't realise that a column can contain dates in different formats, may have imported dates incorrectly and therefore any function, analysis etc based on those dates may be wrong. This is not big, it's HUGE.
@garfieldthecat pls have a read thru the very extensive documentation
Could I please ask you to be a bit more specific on what can be found where in the extensive documentation? Like I said:
-
array_to_datetime seems to have been deprecated. Is that so? If yes, do you know of an up to date alternative? If not, where is it documented, since it's not in the pandas docs?
-
the pandas docs are wrong in saying what kind of objects to_datetime can return, as I explained above. Do you agree? And can I therefore continue to work on my draft of what I think needs updating in the pandas docs? Or am I missing something?
-
I didn't find any explicit mention that read_csv uses to_datetime. Without this clarification, it is not clear at all that to_datetime affects the output of read_csv. The title of this discussion should really be amended to reflect that this bug 1) affects read_csv and 2) means that a pandas column can have dates with different formats, so that one row is dd-mm-yyyy and another mm-dd-yyyy.
-
is there another solution, other than always explicitly specifying the date format, which is in the docs and which I missed?
By the way, what saved me was that I almost always load the data into a SQL server after cleansing it in pandas; SQL would have picked up any inconsistency in the date formats. Had I not had this SQL step in my workflow, I'd probably have to double check YEARS of code and analyses to make sure that nothing was messed up.
well see:http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table for date parsing, the doc-string: http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html?highlight=to_datetime#pandas.to_datetime
I am not sure what else you are refering, array_to_datetime
is an internal implementation, so if you want to fix this issue, then certainly look at that.
the pandas docs are wrong in saying what kind of objects to_datetime can return, as I explained above. Do you agree? And can I therefore continue to work on my draft of what I think needs updating in the pandas docs? Or am I missing something?
and exactly how are they wrong? these are the things that are returned.
You are welcome to propose updates to doc-strings and/or docs, but I would first review in detail what pandas has to offer.
Again, unless I missed it, the links you post do not clearly explain that read_csv relies on to_datetime - which is why I didn't find this issue: I was looking for issue with read_csv
Again, unless I missed it, they also do not clearly explain that, in pandas, one row in a date column can be mm-dd-yyyy while another row in the same column can be dd-mm-yyyy. This is something that should be written in huge bold letters in the documentation of read_csv
I have drafted something, but what is the best way to propose changes to the documentation (not to the code)? Discuss it on a github forum first to reach a consensus? Just write it and post it to github?
PS about what to_datetime() returns: I was sure I remembered it returned dtype object. I guess I was wrong - apologies.
you are welcome to submit a PR: http://pandas.pydata.org/pandas-docs/stable/development/contributing.html
Agreeing with @garfieldthecat - this is a very serious problem - in that correctly formatted dd/mm datetimes will often / usually be garbled by pandas using the default setting, because pandas can and will reverse the dd/mm within a column, as it is guessing separately for each date. It's very easy for the user to miss that this has happened.
I don't think there's any common use for the current behavior, where the guess is per-entry, so:
- By default, the guessed format should be the same for each date in the series. I would personally prefer an incorrect guess of mm/dd for the first value, and the same incorrect guess for all subsequent values, to the current behavior.
- The ideal would be to use a guess that gives a valid date for all dates in the series. This would involve back-tracking to switch days / months when the current date proves valid for dd/mm but not mm/dd. This would only be triggered for the dangerous cases, so even with the performance hit, I think that's an improvement.
- If anyone does have a use for switching formats mid-series, I guess this could be a keyword option. To me, that seems so pathological, that it's reasonable to ask the user to write a custom function for that case.
cc @anmyachev who recently update the internal parsing of dayfirst to be in cython. A patch for this issue would now be somewhat performant.
pandas is of course a completely volunteer project with almost 3000 issues, so statements like this
this is a very serious problem
are not very helpful to anyhow. pull-requests are the way to solve issues.
I love Pandas and use it all the time; I am sorry if I did not appear grateful for the hard work that goes into it. I'm sure you know, but I do contribute to Pandas from time to time, and other foundation libraries, so I know this is work, and I'm happy to help. I say 'serious' to be helpful, not critical - in the same way I would put an importance label on a bug, on a bug-tracker. I'm discussing here rather than on a PR, because I don't know my way around the code, or who is invested in it, yet.
@matthew-brett didn't really mean to call you out :-D, more specifically about prioritization of anything in pandas is subject to resource availability, meaning you can't prioritize, no matter how 'important' something is. Things get prioritized really when someone (include a core dev) submits a PR.
Yes, sorry, I realize 'very serious problem' does sound as if I mean that you owe me something, and it's obviously the other way round, so I'm sorry for that implication, I didn't mean it that way, honest.
I am new to Pandas and github - please be gentle.
If I understand the thread correctly - this is an unaddressed issue for 3.5 years? Don't you think that explicitly coding how a date should be read is an unreasonable solution? It's like importing integers and coding "randomly_scramble_my_number=False". This error is enough to ban use of Pandas in some large organisations :(
I am new to Pandas and github - please be gentle.
If I understand the thread correctly - this is an unaddressed issue for 3.5 years? Don't you think that explicitly coding how a date should be read is an unreasonable solution? It's like importing integers and coding "randomly_scramble_my_number=False". This error is enough to ban use of Pandas in some large organisations :(
@erpda
There are 3000+ open issues in pandas. How should the handful of mostly volunteer folks prioritize? if large orgs want to just take and not help fix then I am not sure what to say.
open source is about finding, reporting, AND helping to fix. there are a many orgs and many many individual contributors to pandas and open source generally that have built an amazing eco system.
Would love for large orgs to step up and seriously fund open source. just taking and not giving back at all IMHO is terrible.
Dear all,
It seems to_datetime is causing significant problem. Here are the codes I've ran. When compare the EARTHQUAKEDATE fields of the original and copied and to_datetime applied dataframes, I see the values are unfortunately different.
Original is in DD/MM/YYYY hh:mm:ss format
EARTHQUAKEDATE 05/10/1977 05:34:43
But modified is in YYYY/DD/MM hh:mm:ss format
EARTHQUAKEDATE 1977-05-10 05:34:43
I'd would expect the result as YYYY-MM-DD hh:mm:ss.
I'd like to do anything I can for the solution.
I've stumbled across this issue in working with our scientific dataset. Specifying the dayfirst argument to read_csv sorted us out. I would agree with the suggestion of printing a warning when inconsistent datetime parsing may be occurring. Can this be done after parsing by checking for inconsistent mapping from the string input to the datetime output across the whole series? Would this introduce a performance penalty?
I'm keen to contribute a fix but I'm a newbie to the pandas project.
@Craskermasker happy to have a contribution towards this
I barely caught this error when trying to process a datasett, and I feel that it should be addressed.
My first thought was to make dayfirst
True
as default, but this does not work as if you input 10.20.2020
it converts it to 2020-10-20
without throwing an error :open_mouth:.
The only solution I see for solving this is to force the use of format
, including making it a mandatory argument when using parse_dates in read_csv
/read_tsv
In read_csv
there should be a check that all dates are converted using the same format.From digging through the source code I would expect it to be this needs to be implemented in https://github.com/pandas-dev/pandas/blob/2990c1ddd137fcaef46d2cde47334314047a580a/pandas/_libs/tslibs/parsing.pyx#L591
In the mean time I have reverted to using numpy.genfromtxt
Just to give an update on this, the example from this issue will now show a warning:
In [1]: import pandas as pd
In [2]: pd.to_datetime(["31/12/2014", "10/03/2011"])
<ipython-input-2-9e192223af3c>:1: UserWarning: Parsing '31/12/2014' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
pd.to_datetime(["31/12/2014", "10/03/2011"])
Out[2]: DatetimeIndex(['2014-12-31', '2011-10-03'], dtype='datetime64[ns]', freq=None)
What's missing is to show a warning when the date string isn't a delimited date string (e.g. "31/12/2014 00:00:00"
) - this would involve adding a warning to when du_parse
is called, and would result in many false positives, but maybe that's OK