dateparser Improve the default parsing of 20110101

Improve the default parsing of 20110101

Open Gallaecio opened this issue 4 years ago • 15 comments

While I don’t think https://github.com/scrapinghub/dateparser/issues/606 is necessarily a bug, I do think that it would be an improvement for Dateparser to parse '20110101' the same as '2011-01-01' by default.

Feb 17 '20 16:02 Gallaecio

hello community !! This is Vyom Goel. I would love to work on this issue. Is there any active IRC channel where I can discuss issues. Thanks

Feb 18 '20 19:02 Vyom16

We can discuss them in GitHub, that way they are easily to find in the future. IRC logs get lost easily :)

Feb 19 '20 09:02 Gallaecio

Hi Gallaecio, 20110101 is interpreted by default as : 2-01-1010-1 (M-D-Y-H) which is also a valid date. Parsing '20110101' the same as '2011-01-01' means changing the default format as YMD, but other inputs can be of the format MDY also. So this is a case of ambiguity. Can you please explain further what exactly is your requirement?

Mar 03 '20 17:03 grestonian

Hey @Gallaecio is this issue still open, was looking forward to some more information on this but didn't see any replies recently. Just wanted to know that the issue was still valid.

Mar 08 '20 16:03 Gs-001

@grestonian I suspect that MDY is a wider-used format than MDYH. When facing a case of ambiguity, I think Datepaser should behave in the most intuitive way by default, and in this case I believe MDY is that.

Mar 12 '20 11:03 Gallaecio

@Gallaecio This true for English, but for not all languages. In a French context for instance (but I suspect we're not alone), the widly-used, idiomatic, default format would be DMY. Besides, I think the computer-science standard default is YMD because it sorts from itself alphanumerically.

Maybe MDY can be picked if the contextual language can be reasonably infered to be English, DMY can be picked within other languages, and YMD can be picked if no language information is available. But I'm almost sure this would lead to very confusing situations.

Or else, make the standard YMD the only default option?

Mar 12 '20 16:03 iago-lito

I think the default date format should be YMD as @iago-lito specified, that would hold true in the specific case mentioned in this issue too. Date format is actually specific to a geographical location so I think that date format should change based on one's location, the location could be inferred based on the the time zone or some other value from the user's system. In a case where we are not able to get the location, we should fall back to YMD. Let me know your thoughts on this idea or if this is somehow accounted for in the current implementation.

Mar 12 '20 17:03 Gs-001

I’m not discussing what the default date order that Dateparser should be. What I’m discussing is what I think should be the default parsing of 20110101 (or 01012011) when no setting (e.g. date order) is specified. I think it should be the 1st of January of 2011 in both cases.

Such a change may be much more complex that changing the default date order.

Mar 12 '20 18:03 Gallaecio

@Gallaecio I think I'm getting you. Are you meaning that it should be somehow detected in 20110101 and 01012011 that the 2011 part is most likely to mean a year (because it's 4 digits long and it starts with 20-), and then infer the format (YMD or DMY) based on this hint?

I agree that it is appealing. It is ambiguous for sure but dateparser will never be exact.

It brings up the idea that not only the contextual language is relevant for inference, but also the set of dates the user is currently working with, because I guess that historical years not under the form 20XX but 12XX or even XXX would mess up that inference. So the change would be complex indeed.

Are you suggesting anything regarding the way the format YMD/DMY/MDY be inferred from the 8 digits? How would

20202001 or
01201702 or
01202020 or even
20202020

be parsed then?

Mar 13 '20 08:03 iago-lito

I have no suggestion regarding the implementation.

Regarding those example input strings, I would expect the default output to be:

20202001 → 2020-01-20
01201702 → 1702-01-20
01202020 → 2020-01-20
20202020 → None

Mar 13 '20 09:03 Gallaecio

Okay, so one not-too-complicated algorithm would resemble:

1. parse XXXXXXXX as YMD, YDM, DMY and MDY. 2. eliminate the ones with impossible M or D values. 3. among the remaining ones, eliminate those whose Y value starts with 0. 4. Take a decision depending on remaining possible formats:

YMD	YDM	DMY	MDY	decision
0	0	0	0	None
0	0	0	1	MDY
0	0	1	0	DMY
0	1	0	0	YDM
1	0	0	0	YMD
0	0	1	1	? language-based or DMY ?
0	1	0	1	? ambiguous ?
0	1	1	0	? language-based or DMY ?
0	1	1	1	? language-based or DMY ?
1	0	0	1	YMD
1	0	1	0	YMD
1	0	1	1	YMD
1	1	0	0	YMD
1	1	0	1	YMD
1	1	1	0	YMD
1	1	1	1	YMD

I think this is rather close from what a human would actually do, but some possibilities are still ambiguous like:

YDM or MDY but no standard, e.g. 12313112, what to decide?
English-style or other style but no standard, e.g. 09102015, how to decide?

Mar 13 '20 13:03 iago-lito

I have implemented a possible solution to this issue, #639 which is along the line suggested by @iago-lito. It only changes how dates of the form XXXXXXXX without Date-Order settings are parsed.

In case such a date can be successfully parsed as one of these ['%m%d%Y' , '%d%m%Y' , '%Y%m%d' , '%Y%d%m' , '%m%Y%d', '%d%Y%m'] (the order can be a bit contentious, but at this time seems the best to me) then this format is used as opposed to one involving %H and %s. In case it's not possible it reverts back to the default parsing method.

@Gallaecio could you please review and suggest changes if any.

Mar 19 '20 12:03 arnavkapoor

Reopened to handle the scenario also for the default value of DATE_ORDER.

Jun 04 '20 11:06 Gallaecio

I am not sure if I am doing something wrong here but when I try the same date, dateparser returns None to me.

>>> import dateparser
>>> dateparser.parse('20110104')
>>>

I am using Python 3.10.4 and dateparser 1.1.1

Update: It is a date with no spaces. So we need to add the parser no-spaces-time to the PARSERS list. It works with that.

>>> dateparser.parse('20110104', settings={'PARSERS': ['no-spaces-time']})
datetime.datetime(1010, 2, 1, 4, 0)

Jun 18 '22 11:06 gutsytechster

I tried to understand why I don't get the correct results, even when we have the logic implemented for it. There is a bug with the current implementation.

In the following piece of code - https://github.com/scrapinghub/dateparser/blob/a7ca7a50423ef06cb7b8c80b825b9c158debf950/dateparser/parser.py#L168-L175

There will always be a DATE_ORDER which defaults to MDY, and the code will never reach the else part of the logic. The next thing I wondered is when it was implemented this way, why didn't the test fail? Apparently, there is a bug within the tests as well. The following piece of code set the DATE_ORDER to an empty string. https://github.com/scrapinghub/dateparser/blob/a7ca7a50423ef06cb7b8c80b825b9c158debf950/tests/test_parser.py#L319

However, when you try to do the same in the REPL, it gives you an error that an empty string is not a valid value for the settings.

>>> import dateparser
>>> dateparser.parse('20110104', settings={'PARSERS': ['no-spaces-time'], 'DATE_ORDER':''})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gutsytechster/Documents/FOSS/dateparser/dateparser/conf.py", line 92, in wrapper
    return f(*args, **kwargs)
  File "/home/gutsytechster/Documents/FOSS/dateparser/dateparser/__init__.py", line 58, in parse
    parser = DateDataParser(languages=languages, locales=locales,
  File "/home/gutsytechster/Documents/FOSS/dateparser/dateparser/conf.py", line 92, in wrapper
    return f(*args, **kwargs)
  File "/home/gutsytechster/Documents/FOSS/dateparser/dateparser/date.py", line 377, in __init__
    check_settings(settings)
  File "/home/gutsytechster/Documents/FOSS/dateparser/dateparser/conf.py", line 252, in check_settings
    raise SettingValidationError(
dateparser.conf.SettingValidationError: "" is not a valid value for "DATE_ORDER", it should be: "DMY", "DYM", "MDY", "MYD", "YDM" or "YMD"

It simply means that we are not checking the values of these settings being set within the test, giving us the impression that nothing is breaking. When I applied the check_settings method in the same test class, multiple tests fails because we have set the DATE_ORDER to '' for other test cases as well.

I believe this was because the settings in the test were not picking up the default values. We can add it to select the default values, as it would in an actual function call.

Jun 18 '22 13:06 gutsytechster

dateparser dateparser copied to clipboard

Improve the default parsing of 20110101

dateparser
dateparser copied to clipboard