dateparser
dateparser copied to clipboard
Improve the default parsing of 20110101
While I don’t think https://github.com/scrapinghub/dateparser/issues/606 is necessarily a bug, I do think that it would be an improvement for Dateparser to parse '20110101' the same as '2011-01-01' by default.
hello community !! This is Vyom Goel. I would love to work on this issue. Is there any active IRC channel where I can discuss issues. Thanks
We can discuss them in GitHub, that way they are easily to find in the future. IRC logs get lost easily :)
Hi Gallaecio, 20110101 is interpreted by default as : 2-01-1010-1 (M-D-Y-H) which is also a valid date. Parsing '20110101' the same as '2011-01-01' means changing the default format as YMD, but other inputs can be of the format MDY also. So this is a case of ambiguity. Can you please explain further what exactly is your requirement?
Hey @Gallaecio is this issue still open, was looking forward to some more information on this but didn't see any replies recently. Just wanted to know that the issue was still valid.
@grestonian I suspect that MDY is a wider-used format than MDYH. When facing a case of ambiguity, I think Datepaser should behave in the most intuitive way by default, and in this case I believe MDY is that.
@Gallaecio This true for English, but for not all languages. In a French context for instance (but I suspect we're not alone), the widly-used, idiomatic, default format would be DMY. Besides, I think the computer-science standard default is YMD because it sorts from itself alphanumerically.
Maybe MDY can be picked if the contextual language can be reasonably infered to be English, DMY can be picked within other languages, and YMD can be picked if no language information is available. But I'm almost sure this would lead to very confusing situations.
Or else, make the standard YMD the only default option?
I think the default date format should be YMD as @iago-lito specified, that would hold true in the specific case mentioned in this issue too. Date format is actually specific to a geographical location so I think that date format should change based on one's location, the location could be inferred based on the the time zone or some other value from the user's system. In a case where we are not able to get the location, we should fall back to YMD. Let me know your thoughts on this idea or if this is somehow accounted for in the current implementation.
I’m not discussing what the default date order that Dateparser should be. What I’m discussing is what I think should be the default parsing of 20110101
(or 01012011
) when no setting (e.g. date order) is specified. I think it should be the 1st of January of 2011 in both cases.
Such a change may be much more complex that changing the default date order.
@Gallaecio I think I'm getting you. Are you meaning that it should be somehow detected in 20110101
and 01012011
that the 2011
part is most likely to mean a year (because it's 4 digits long and it starts with 20-
), and then infer the format (YMD or DMY) based on this hint?
I agree that it is appealing. It is ambiguous for sure but dateparser
will never be exact.
It brings up the idea that not only the contextual language is relevant for inference, but also the set of dates the user is currently working with, because I guess that historical years not under the form 20XX
but 12XX
or even XXX
would mess up that inference. So the change would be complex indeed.
Are you suggesting anything regarding the way the format YMD/DMY/MDY be inferred from the 8 digits? How would
-
20202001
or -
01201702
or -
01202020
or even -
20202020
be parsed then?
I have no suggestion regarding the implementation.
Regarding those example input strings, I would expect the default output to be:
-
20202001
→2020-01-20
-
01201702
→1702-01-20
-
01202020
→2020-01-20
-
20202020
→None
Okay, so one not-too-complicated algorithm would resemble:
1. parse XXXXXXXX
as YMD, YDM, DMY and MDY.
2. eliminate the ones with impossible M or D values.
3. among the remaining ones, eliminate those whose Y value starts with 0.
4. Take a decision depending on remaining possible formats:
YMD | YDM | DMY | MDY | decision |
---|---|---|---|---|
0 | 0 | 0 | 0 | None |
0 | 0 | 0 | 1 | MDY |
0 | 0 | 1 | 0 | DMY |
0 | 1 | 0 | 0 | YDM |
1 | 0 | 0 | 0 | YMD |
0 | 0 | 1 | 1 | ? language-based or DMY ? |
0 | 1 | 0 | 1 | ? ambiguous ? |
0 | 1 | 1 | 0 | ? language-based or DMY ? |
0 | 1 | 1 | 1 | ? language-based or DMY ? |
1 | 0 | 0 | 1 | YMD |
1 | 0 | 1 | 0 | YMD |
1 | 0 | 1 | 1 | YMD |
1 | 1 | 0 | 0 | YMD |
1 | 1 | 0 | 1 | YMD |
1 | 1 | 1 | 0 | YMD |
1 | 1 | 1 | 1 | YMD |
I think this is rather close from what a human would actually do, but some possibilities are still ambiguous like:
- YDM or MDY but no standard, e.g.
12313112
, what to decide? - English-style or other style but no standard, e.g.
09102015
, how to decide?
I have implemented a possible solution to this issue, #639 which is along the line suggested by @iago-lito. It only changes how dates of the form XXXXXXXX
without Date-Order settings are parsed.
In case such a date can be successfully parsed as one of these ['%m%d%Y' , '%d%m%Y' , '%Y%m%d' , '%Y%d%m' , '%m%Y%d', '%d%Y%m']
(the order can be a bit contentious, but at this time seems the best to me) then this format is used as opposed to one involving %H
and %s
.
In case it's not possible it reverts back to the default parsing method.
@Gallaecio could you please review and suggest changes if any.
Reopened to handle the scenario also for the default value of DATE_ORDER
.
I am not sure if I am doing something wrong here but when I try the same date, dateparser
returns None
to me.
>>> import dateparser
>>> dateparser.parse('20110104')
>>>
I am using Python 3.10.4 and dateparser 1.1.1
Update: It is a date with no spaces. So we need to add the parser no-spaces-time
to the PARSERS
list. It works with that.
>>> dateparser.parse('20110104', settings={'PARSERS': ['no-spaces-time']})
datetime.datetime(1010, 2, 1, 4, 0)
I tried to understand why I don't get the correct results, even when we have the logic implemented for it. There is a bug with the current implementation.
In the following piece of code - https://github.com/scrapinghub/dateparser/blob/a7ca7a50423ef06cb7b8c80b825b9c158debf950/dateparser/parser.py#L168-L175
There will always be a DATE_ORDER
which defaults to MDY
, and the code will never reach the else
part of the logic. The next thing I wondered is when it was implemented this way, why didn't the test fail? Apparently, there is a bug within the tests as well. The following piece of code set the DATE_ORDER
to an empty string.
https://github.com/scrapinghub/dateparser/blob/a7ca7a50423ef06cb7b8c80b825b9c158debf950/tests/test_parser.py#L319
However, when you try to do the same in the REPL, it gives you an error that an empty string is not a valid value for the settings.
>>> import dateparser
>>> dateparser.parse('20110104', settings={'PARSERS': ['no-spaces-time'], 'DATE_ORDER':''})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/gutsytechster/Documents/FOSS/dateparser/dateparser/conf.py", line 92, in wrapper
return f(*args, **kwargs)
File "/home/gutsytechster/Documents/FOSS/dateparser/dateparser/__init__.py", line 58, in parse
parser = DateDataParser(languages=languages, locales=locales,
File "/home/gutsytechster/Documents/FOSS/dateparser/dateparser/conf.py", line 92, in wrapper
return f(*args, **kwargs)
File "/home/gutsytechster/Documents/FOSS/dateparser/dateparser/date.py", line 377, in __init__
check_settings(settings)
File "/home/gutsytechster/Documents/FOSS/dateparser/dateparser/conf.py", line 252, in check_settings
raise SettingValidationError(
dateparser.conf.SettingValidationError: "" is not a valid value for "DATE_ORDER", it should be: "DMY", "DYM", "MDY", "MYD", "YDM" or "YMD"
It simply means that we are not checking the values of these settings being set within the test, giving us the impression that nothing is breaking. When I applied the check_settings
method in the same test class, multiple tests fails because we have set the DATE_ORDER
to ''
for other test cases as well.
I believe this was because the settings in the test were not picking up the default values. We can add it to select the default values, as it would in an actual function call.