dateparser icon indicating copy to clipboard operation
dateparser copied to clipboard

dateparser.parse() prefers current year also when 'STRICT_PARSING' is True

Open nico-kn opened this issue 2 years ago • 2 comments

When I use the dateparser.parse() functionality I expect it to behave the same way for two different input dates that are off by one year. However, based on the current implementation it seems to prefer the current year and detect that in a string. This can be seen in the following comparison:

>>> dateparser.parse('03/06/2020a', settings={'STRICT_PARSING':True})
datetime.datetime(2, 1, 4, 9, 31, 4, 821531)
>>> dateparser.parse('03/06/2022a', settings={'STRICT_PARSING':True})
datetime.datetime(2022, 6, 3, 0, 0)

The same input string, with the only difference of the year 2022a instead of 2020a or 2021a will be parsed correctly if it is the current year. However, this is not deterministic and IMO not expected behavior, as the result will change depending on the current year of the localization.

This applies to the default settings as well as the {'STRICT_PARSING':True} settings.

Could you please advise with which settings I could achieve a deterministic result here? Or ideally improve the implementation here / guide me towards where I could contribute and fix this issue?

nico-kn avatar Jan 04 '22 08:01 nico-kn

Hey @nico-kn , I think If the 'a' at the end of the string is not added it is parsed as expected. It is important that you provide a valid input string(without 'a' in this case) otherwise it'll cause false positives as happened in this case. checkout Readme-false positives section

dhananjaypai08 avatar Jan 07 '22 19:01 dhananjaypai08

Hey @dhananjaypai08, I think that makes sense that input strings should be cleansed beforehand and if this is not the case it can cause false positives. Totally agreed.

But I still think that even if the input strings are not cleansed, i.e. if the trailing a is still in the input string, the library should have the same behavior for very similar input strings. So my remaining question is still: Why is the correct date extracted here if the current year is part of the input string and this is not the case for years that are not the current year?

My impression would be that this could be improved and I am happy to help here if someone helps to point me to the right places.

nico-kn avatar Jan 12 '22 09:01 nico-kn