dateparser icon indicating copy to clipboard operation
dateparser copied to clipboard

ISO 8601 YYYYMMDD format

Open brl0 opened this issue 3 years ago • 2 comments

First, thanks for the great project!

Our project standardized on using dates in the ISO 8601 YYYYMMDD format. According to wikipedia:

the standard allows both the "YYYY-MM-DD" and YYYYMMDD formats for complete calendar date representations

Dates in this format return None in the following example:

from datetime import datetime

import dateparser

result = dateparser.parse("20210629", languages=["en"])  # returns None
assert result == datetime(2021, 6, 29)  # fails

I then tried the following, which produced incorrect results.

from datetime import datetime

from dateparser.date import DateDataParser
from dateparser_data.settings import default_parsers

parsers = default_parsers.copy()
parsers.append('no-spaces-time')

ddp = DateDataParser(languages=["en"], settings={"PARSERS": parsers})

result = ddp.get_date_data("20210629")  # returns datetime(1062, 2, 2, 9, 0)
assert result["date_obj"] == datetime(2021, 6, 29)  # fails

This issue may be more or less related to a handful of other issues as well: #360, #765, #867, #914, some of which PR #790 may help address, although that may not address this particular format.

Since I began typing this issue, I figured out that I am able to get the correct results with dateparser.parse("20210629", date_formats=["%Y%m%d"]), which might work in this case. Although I do wonder, should this format be expected to work without explicitly defining the date_format?

On a related note, from what I could tell, the DateDataParser does not accept a date_formats parameter, even in the settings dictionary, instead it looks like that must be passed through to the method called to parse the string, which seemed a bit unintuitive to me.

Also, while I haven't yet tried using this at scale, I wonder if the dateparser.parse function might have some opportunity for performance improvement by not instantiating the DateDataParser class on each call, one possible (untested) approach might be something like the following:

from functools import lru_cache

@lru_cache
def get_ddp(**kwargs):
    return DateDataParser(**kwargs)

I would be happy to open separate issues for the above points if preferred.

Sorry for the lengthy post, and thanks again for the your contributions to the community.

brl0 avatar Jun 29 '21 23:06 brl0

Hey @brl0!! Sorry for the late answer, and thank you for these comments, they are good points indeed!

As you mention, it would be amazing to split this in different issues, as it's hard to discuss all the points at the same time.

  1. date format. We could see if we can support this format by default. The issue with these formats is that they are numbers and supporting this produces a lot of false positives. You will see that there's a parser that it's not enabled by default. It is the 'no-spaces-time' parser and it was removed from the default parsers because it was causing a lot of false positives. You can try it like this:
>>> dateparser.parse('20210629', languages=["en"], settings={'PARSERS': ['no-spaces-time']})
datetime.datetime(1062, 2, 2, 9, 0)

As we can add support to these formats individually with date_formats, I think it's not a big issue. I don't mean we shouldn't support it by default, but we have to do it carefully (maybe adding a new parser that can be enabled if desired?).

  1. DateDataParser does not accept a date_formats Yeah, that's weird. In fact, the date_format is applied twice, before translating/cleaning and after if I remember correctly. Feel free to open a new issue and submit a PR with this change, it could be nice to have it :)

  2. DateDataParser caching If you check the dateparser/dateparser/__init__.py file, you will see that there's a "default_parser" to avoid instantiating it every time:

_default_parser = DateDataParser()

So it's only instantiated when we find languages, locales, region or the settings doesn't have the default value:

    if languages or locales or region or not settings._default:
        parser = DateDataParser(languages=languages, locales=locales,
                                region=region, settings=settings)

However, it could be possible that we use dateparser.parse() with the same settings, languages, etc. multiple times, so maybe we could cache also this.

Let me know if you could work on any of these things and/or separate this issue :)

Thanks Brian!

noviluni avatar Jul 24 '21 13:07 noviluni

Thanks for the response @noviluni. As mentioned, I have created separate issues #949 and #950 to track the side note comments I made above.

If I find some time, I may take a look at trying to implement those suggestions, but since I am not sure if I will be able to get to it, don't let that stop anybody else that may have interest in doing so.

Regarding the ability to parse this particular date format, I guess I was somewhat surprised at the incorrect output of the no-spaces-time parser. I am curious if adding some optional date range validation capability might allow the parser object to progress through to other lower precedence parsers in the event that the validation fails. While I can appreciate the complexity of attempting to parse the multitude of combinations of date representation formats, I suppose in my narrow view of the world I think I would have expected standard date formats to take precedence. With that said, I do really appreciate the flexibility this library provides in accepting a variety of formats.

brl0 avatar Jul 24 '21 20:07 brl0