pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: Behavior with fallback between raise and coerce #46071

Open srotondo opened this issue 2 years ago • 13 comments

  • [ ] closes #46071 (Replace xxxx with the Github issue number)
  • [ ] Tests added and passed if fixing a bug or adding a new feature
  • [ ] All code checks passed.
  • [ ] Added type annotations to new arguments/methods/functions.
  • [ ] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

The problem reported by this issue is that success in parsing a date string depends on how errors are reported (coerce vs raise). How the error is reported should not affect whether or not a parsing error occurs. Initially it appears that experiment 1 should return an error (as experiment 3 does) because the second date string does not match the format established by the first string. However, there is a fallback mechanism that is supposed to try parsing again without the effect of the infer_datetime_format flag. So in fact, experiment 3 should succeed in parsing the date as experiment 1 does.

The problem is that the fallback mechanism only works when errors=raise. This bug fix applies the same logic when errors=coerce so that the fallback parsing will take place regardless of how errors are reported.

srotondo avatar Jul 15 '22 23:07 srotondo

Thanks for the PR @srotondo.

Do you mind reverting the unrelated changes (removing or adding blank lines) please? Also, we need to add a test that captures the problem, fails with the original code, and passes after your changes. The issue number, better have it as a comment in the test, not in the code.

Also, we need a release note with the change. You can have a look at any merged bug PRs as a reference.

datapythonista avatar Jul 21 '22 04:07 datapythonista

@datapythonista I've added the test and the release note to this code, is there anything else you would like me to do?

srotondo avatar Aug 03 '22 19:08 srotondo

The fallback behavior is apparent in the code but not mentioned in the documentation for infer_datetime_format.

infer_datetime_format: bool, default False If True and no format is given, attempt to infer the format of the datetime strings based on the first non-NaN element, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.

I think it would be accurate to add one more sentence at the end:

If subsequent datetime strings do not follow the inferred format, parsing will fall back to the slower method of determining the format for each string individually.

That should be sufficient to remove the expectation that an error will be raised without committing to an exact description of the fallback parsing.

scott-r avatar Aug 10 '22 19:08 scott-r

@datapythonista I believe I have made the appropriate changes. Is there anything else?

srotondo avatar Aug 16 '22 04:08 srotondo

@datapythonista I believe that I have made all the changes you have requested, including simplifying the tests. Is there anything else you would like me to do?

srotondo avatar Aug 29 '22 17:08 srotondo

@datapythonista I've split up the test into multiple pieces like you asked.

srotondo avatar Sep 02 '22 20:09 srotondo

Thanks @srotondo, this is more readable now.

But looks like you're somehow reimplementing the tests in here, no? Would be good to start by adding the exact case or test that your fix addresses, which is not immediately clear to me. Thanks!

datapythonista avatar Sep 06 '22 14:09 datapythonista

But looks like you're somehow reimplementing the tests in here, no? Would be good to start by adding the exact case or test that your fix addresses, which is not immediately clear to me. Thanks!

@datapythonista It might look like some of these tests are reimplementations, but there are some key differences in the cases that are being tested. Given the line number you linked in that test, I assume you are mainly referring to these 3 tests in that file: test_datetime_invalid_scalar, test_datetime_outofbounds_scalar, and test_datetime_invalid_index. While they do share a few similarities, there are some important key differences between these tests and my added tests.

To start with, test_datetime_invalid_scalar and test_datetime_outofbounds_scalar only deal in scalar values, unlike my tests, which deal with lists. The difference from test_datetime_invalid_index lies largely in the data being tested. The list used in test_datetime_invalid_index is entirely made up of values that will not parse, while the lists in my tests always start with a valid date. This is important because infer_datetime_format only becomes relevant on the value after the first valid date, since it would need a format already to infer from. Therefore this test doesn't really check the fallback code as my tests are meant to.

srotondo avatar Sep 06 '22 17:09 srotondo

Thanks @srotondo that makes sense. Is there a reason why the tests you added are not next to the ones I mentioned? Seems like they should be next to each other. Or, not sure if it could make sense to add an extra assert in those lines to check the same cases when the invalid values are in a Series following a valid value.

And it's still not clear to me what's the exact case that you're fixing. Or are all the tests that you added failing without the change in the code function?

datapythonista avatar Sep 06 '22 18:09 datapythonista

@datapythonista The true failing case is that when 2 valid dates with different formats were passed in using the errors = "coerce" option and infer_datetime_format = True, to_datetime would return NaT for the second value, when it should correctly parse the second value. The second and third tests I wrote ensure that coerce still returns NaT when it should return NaT and that the cases where errors = "raise" still raises exceptions when it should, respectively.

The choice of file for this test is mostly arbitrary, and I can move it if you'd like. However, could you please tell me if there is anything else you would like me to change as well so that I can do all of it at the same time and we can get this bugfix merged.

srotondo avatar Sep 06 '22 20:09 srotondo

Just wanted to ask whether this is really what would be best - why not make infer_datetime_format strict:

  • if the format is inferred, and some row doesn't conform to it, it just errors or coerces, rather than falling back to the slower path

@MarcoGorelli There exists large amounts of code with the express purpose of causing the operation to fall back if the formats aren't the same. The original documentation seems to strongly suggest that infer_datetime_format exists to cause parsing to speed up whenever possible, rather than causing otherwise valid dates to fail to parse. Also, given how many different ways a date string can be written, forcing all inputs to to_datetime to be the same format doesn't seem like the best course of action to me. I think that making this operation strict would go against the intended functionality.

srotondo avatar Sep 16 '22 20:09 srotondo

Currently, the docs don't specify what happens if the format can be inferred, but subsequent rows don't match that format

And users don't expect the format to change midway, especially when using this option, e.g. as in https://github.com/pandas-dev/pandas/issues/46210

cc @mroeschke @jreback @jorisvandenbossche if you have comments

TL;DR: I'm suggesting that when using infer_datetime_format=True, the format detected from the first non-NaN value should be used to parse the rest of the Series, exactly as if the user had passed it to format=

Example Concretely, I think the following two should behave the same:

pd.to_datetime(['01-31-2000', '20-01-2000'], infer_datetime_format=True, errors='raise')
pd.to_datetime(['01-31-2000', '20-01-2000'], format='%m-%d-%Y', errors='raise')

Currently, the latter raises (as expected) whereas the former swaps format midway

MarcoGorelli avatar Sep 16 '22 20:09 MarcoGorelli

@MarcoGorelli Even though it's not very well documented, looking at the code and following the pathway makes it very clear that the fallback is intentional. If you would like to change the behavior, I think that should be a separate PR. This code change deals with the fact that raise and coerce were behaving inconsistently with infer_datetime_format, which has been fixed. I think that removing the fallback behavior altogether is a much larger and separate issue.

srotondo avatar Sep 16 '22 21:09 srotondo

PDEP4 was accepted, and it looks like the implementation is almost in

Let's close this then

Thanks @srotondo for your PR - you've done great work, and I'm sorry to be closing it, I just think a different direction is a better direction for the future of pandas

MarcoGorelli avatar Nov 28 '22 12:11 MarcoGorelli