pandas
pandas copied to clipboard
BUG: Behavior with fallback between raise and coerce #46071
- [ ] closes #46071 (Replace xxxx with the Github issue number)
- [ ] Tests added and passed if fixing a bug or adding a new feature
- [ ] All code checks passed.
- [ ] Added type annotations to new arguments/methods/functions.
- [ ] Added an entry in the latest
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.
The problem reported by this issue is that success in parsing a date string depends on how errors are reported (coerce vs raise). How the error is reported should not affect whether or not a parsing error occurs. Initially it appears that experiment 1 should return an error (as experiment 3 does) because the second date string does not match the format established by the first string. However, there is a fallback mechanism that is supposed to try parsing again without the effect of the infer_datetime_format
flag. So in fact, experiment 3 should succeed in parsing the date as experiment 1 does.
The problem is that the fallback mechanism only works when errors=raise
. This bug fix applies the same logic when errors=coerce
so that the fallback parsing will take place regardless of how errors are reported.
Thanks for the PR @srotondo.
Do you mind reverting the unrelated changes (removing or adding blank lines) please? Also, we need to add a test that captures the problem, fails with the original code, and passes after your changes. The issue number, better have it as a comment in the test, not in the code.
Also, we need a release note with the change. You can have a look at any merged bug PRs as a reference.
@datapythonista I've added the test and the release note to this code, is there anything else you would like me to do?
The fallback behavior is apparent in the code but not mentioned in the documentation for infer_datetime_format
.
infer_datetime_format: bool, default False If True and no format is given, attempt to infer the format of the datetime strings based on the first non-NaN element, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.
I think it would be accurate to add one more sentence at the end:
If subsequent datetime strings do not follow the inferred format, parsing will fall back to the slower method of determining the format for each string individually.
That should be sufficient to remove the expectation that an error will be raised without committing to an exact description of the fallback parsing.
@datapythonista I believe I have made the appropriate changes. Is there anything else?
@datapythonista I believe that I have made all the changes you have requested, including simplifying the tests. Is there anything else you would like me to do?
@datapythonista I've split up the test into multiple pieces like you asked.
Thanks @srotondo, this is more readable now.
But looks like you're somehow reimplementing the tests in here, no? Would be good to start by adding the exact case or test that your fix addresses, which is not immediately clear to me. Thanks!
But looks like you're somehow reimplementing the tests in here, no? Would be good to start by adding the exact case or test that your fix addresses, which is not immediately clear to me. Thanks!
@datapythonista It might look like some of these tests are reimplementations, but there are some key differences in the cases that are being tested. Given the line number you linked in that test, I assume you are mainly referring to these 3 tests in that file: test_datetime_invalid_scalar
, test_datetime_outofbounds_scalar
, and test_datetime_invalid_index
. While they do share a few similarities, there are some important key differences between these tests and my added tests.
To start with, test_datetime_invalid_scalar
and test_datetime_outofbounds_scalar
only deal in scalar values, unlike my tests, which deal with lists. The difference from test_datetime_invalid_index
lies largely in the data being tested. The list used in test_datetime_invalid_index
is entirely made up of values that will not parse, while the lists in my tests always start with a valid date. This is important because infer_datetime_format
only becomes relevant on the value after the first valid date, since it would need a format already to infer from. Therefore this test doesn't really check the fallback code as my tests are meant to.
Thanks @srotondo that makes sense. Is there a reason why the tests you added are not next to the ones I mentioned? Seems like they should be next to each other. Or, not sure if it could make sense to add an extra assert in those lines to check the same cases when the invalid values are in a Series following a valid value.
And it's still not clear to me what's the exact case that you're fixing. Or are all the tests that you added failing without the change in the code function?
@datapythonista The true failing case is that when 2 valid dates with different formats were passed in using the errors = "coerce"
option and infer_datetime_format = True
, to_datetime
would return NaT
for the second value, when it should correctly parse the second value. The second and third tests I wrote ensure that coerce
still returns NaT
when it should return NaT
and that the cases where errors = "raise"
still raises exceptions when it should, respectively.
The choice of file for this test is mostly arbitrary, and I can move it if you'd like. However, could you please tell me if there is anything else you would like me to change as well so that I can do all of it at the same time and we can get this bugfix merged.
Just wanted to ask whether this is really what would be best - why not make
infer_datetime_format
strict:
- if the format is inferred, and some row doesn't conform to it, it just errors or coerces, rather than falling back to the slower path
@MarcoGorelli There exists large amounts of code with the express purpose of causing the operation to fall back if the formats aren't the same. The original documentation seems to strongly suggest that infer_datetime_format
exists to cause parsing to speed up whenever possible, rather than causing otherwise valid dates to fail to parse. Also, given how many different ways a date string can be written, forcing all inputs to to_datetime
to be the same format doesn't seem like the best course of action to me. I think that making this operation strict would go against the intended functionality.
Currently, the docs don't specify what happens if the format can be inferred, but subsequent rows don't match that format
And users don't expect the format to change midway, especially when using this option, e.g. as in https://github.com/pandas-dev/pandas/issues/46210
cc @mroeschke @jreback @jorisvandenbossche if you have comments
TL;DR: I'm suggesting that when using infer_datetime_format=True
, the format detected from the first non-NaN value should be used to parse the rest of the Series, exactly as if the user had passed it to format=
Example Concretely, I think the following two should behave the same:
pd.to_datetime(['01-31-2000', '20-01-2000'], infer_datetime_format=True, errors='raise')
pd.to_datetime(['01-31-2000', '20-01-2000'], format='%m-%d-%Y', errors='raise')
Currently, the latter raises (as expected) whereas the former swaps format midway
@MarcoGorelli Even though it's not very well documented, looking at the code and following the pathway makes it very clear that the fallback is intentional. If you would like to change the behavior, I think that should be a separate PR. This code change deals with the fact that raise
and coerce
were behaving inconsistently with infer_datetime_format
, which has been fixed. I think that removing the fallback behavior altogether is a much larger and separate issue.
PDEP4 was accepted, and it looks like the implementation is almost in
Let's close this then
Thanks @srotondo for your PR - you've done great work, and I'm sorry to be closing it, I just think a different direction is a better direction for the future of pandas