dateparser icon indicating copy to clipboard operation
dateparser copied to clipboard

YYYY-MM-DD interpreted as YYYY-DD-MM

Open lopuhin opened this issue 2 years ago • 7 comments

YYYY-MM-DD interpreted as YYYY-DD-MM for arabic, but also looks like in other languages which prefer DMY order, but this looks strange -- it seems that if year is first, then we should ignore DMY / MDY and just use YMD for all locales?

Examples:

>>> dateparser.parse('2023-11-08', languages=['ar'])
datetime.datetime(2023, 8, 11, 0, 0)

>>> dateparser.parse('2023-11-08', languages=['en'], region='GB')
datetime.datetime(2023, 8, 11, 0, 0)

>>> dateparser.parse('2023-11-08', languages=['en'], region='US')
datetime.datetime(2023, 11, 8, 0, 0)

>>> dateparser.parse('2023-11-08', languages=['en'])
datetime.datetime(2023, 11, 8, 0, 0)

Side note: in reality US also has MDY date order, so if we'd interpret en as en-US and if it had MDY set, then we'd parse a lot more dates incorrectly.

lopuhin avatar Nov 15 '23 18:11 lopuhin

https://github.com/scrapinghub/dateparser/pull/790 by @Gallaecio might be related, but not sure if it's enough, because the date we do get is formatted in a more weird way, as 2023 - 11 - 08 (with extra spaces).

lopuhin avatar Nov 15 '23 18:11 lopuhin

According to wikipedia, YDM is used in just 4 few countries: https://en.wikipedia.org/wiki/Calendar_date#Gregorian,year–day–month(YDM), but it looks like we're inferring that MDY (very popular) implies YDM (very rare)

lopuhin avatar Nov 15 '23 18:11 lopuhin

Came here to confirm this affects German as well, another language which uses DMY for local date formats, but also just found issue #765, which already reported this problem back in 2020...

keikoro avatar Feb 09 '24 13:02 keikoro

Interestingly, when it comes to ISO 8601 dates, DMY-related settings seem to also partially disable the built-in mechanism which swaps date components on impossible combinations... when that mechanism could theoretically "save" 2/3 of dates being misinterpreted (based on numbers > 12).

Examples parsing correctly formatted ISO date "1960-12-23":

>>> dateparser.parse("1960-12-23")  # default
datetime.datetime(1960, 12, 23, 0, 0)

>>> dateparser.parse("1960-12-23", languages=["en"])  # languages set to MYD language
datetime.datetime(1960, 12, 23, 0, 0)

>>> dateparser.parse("1960-12-23", languages=["de"])  # languages set to DMY language
# None (implicit)

>>> dateparser.parse("1960-12-23", languages=["en"], settings={"DATE_ORDER": "DMY"})  # languages set to MYD language, DATE_ORDER set to DMY
# None (implicit)

>>> dateparser.parse("1960-12-23", languages=["de"], settings={"DATE_ORDER": "MDY"})  # languages set to DMY language, DATE_ORDER set to MDY
datetime.datetime(1960, 12, 23, 0, 0)

... The last two examples have the same result even with PREFER_LOCALE_DATE_ORDER set (whether True or False).

Examples parsing jumbled ISO date "1960-23-12":

>>> dateparser.parse("1960-23-12")  # default
datetime.datetime(1960, 12, 23, 0, 0)

>>> dateparser.parse("1960-23-12", languages=["en"])  # languages set to MYD language
datetime.datetime(1960, 12, 23, 0, 0)

>>> dateparser.parse("1960-23-12", languages=["de"])  # languages set to DMY language
datetime.datetime(1960, 12, 23, 0, 0)

>>> dateparser.parse("1960-23-12", settings={"DATE_ORDER": "DMY"})  # DATE_ORDER set to DMY
datetime.datetime(1960, 12, 23, 0, 0)

... The jumbled date is parsed correctly for all possible combinations of the above languages + DATE_ORDER settings, also with and without PREFER_LOCALE_DATE_ORDER set (whether True or False).

keikoro avatar Feb 09 '24 16:02 keikoro

@lopuhin It looks like the problem can be worked around by including the format codes for YYYY-(M)M-(D)D in date_formats in addition to setting your DMY language:

>>> dateparser.parse("2023-11-08", languages=["ar"], date_formats=["%Y-%m-%d"])
datetime.datetime(2023, 11, 8, 0, 0)

Other dates will continue to be interpreted as DMY:

>>> dateparser.parse("8.11.23", languages=["ar"], date_formats=["%Y-%m-%d"])
datetime.datetime(2023, 11, 8, 0, 0)

>>> dateparser.parse("8/11/23", languages=["ar"], date_formats=["%Y-%m-%d"])
datetime.datetime(2023, 11, 8, 0, 0)

Same for other hyphenated dates, e.g. 08-11-23, though it'd probably wise to not use hyphens at all with the explicitly set languages, just to avoid confusion. Or to always require the full year to be set everywhere and/or to not also allow %y for YY in date_formats.

keikoro avatar Feb 09 '24 18:02 keikoro

Interesting, thanks for suggestion @keikoro . I wonder if passing date_formats=["%Y-%m-%d"] can lead to any unwanted changes in date parsing for this or other languages?

lopuhin avatar Feb 19 '24 16:02 lopuhin

I have a variation of this problem. I'm trying to parse invoice lines from a lot of different subcontractors. Much of it is in some variation of DMY or YMD, but never ever YDM, which I'm currently fighting. Refer to the code example below for some tests.

I was wondering if it'd be easy to amend the DATE_ORDER handling to accept a list of allowed values, or adding another one like FORBIDDEN_DATE_ORDER to specifically forbid American-style dates.

The code below requires rich, sorry about that.

import datetime
import dateparser.search
import dateparser.conf
import rich

def extract_dates(sample, debug=True):

    return dateparser.search.search_dates(
        sample,
        languages=['da', 'en'],
        settings={
            'PREFER_LOCALE_DATE_ORDER': True,
            'DATE_ORDER': 'DMY',
            'PREFER_DATES_FROM': 'past',
            'STRICT_PARSING': True,                  # There must be a day, month, year
            'PARSERS': ['absolute-time']
        },
        add_detected_language=True,
    )

tests = [
    ("sdds 07/09/2024  1. kons", datetime.datetime(2024, 9, 7, 0, 0)),
    ("sdds 30. september 2024 første kons", datetime.datetime(2024, 9, 30, 0, 0)),
    ("sdds 30. september 2024  1. kons", datetime.datetime(2024, 9, 30, 0, 0)),
    ("sdds 30. september 2024 sdw 1. kons", datetime.datetime(2024, 9, 30, 0, 0)),
    ("sdds 2024-11-02  1. kons", datetime.datetime(2024, 11, 2, 0, 0)),
    ("sdds 4. kons 3. februar 2023", datetime.datetime(2023, 2, 3, 0, 0)),
    ("sdds  4. kons 2. marts 2023", datetime.datetime(2023, 3, 2, 0, 0)),
    
]

for sample, correct in tests:
    results = extract_dates(sample)[0]

    date_part_of_string, result, language = extract_dates(sample)[0]
    if result == correct:
        rich.print(f"{date_part_of_string} -- [green]{sample}[/green] -- {result}")
    else:
        rich.print(f"{date_part_of_string} -- [red]{sample}[/red] -- {result}")

jchillerup avatar Sep 27 '24 14:09 jchillerup