dateparser icon indicating copy to clipboard operation
dateparser copied to clipboard

Wrong prioritization of languages

Open ivanprado opened this issue 5 years ago • 6 comments

I think there is something wrong in dateparser prioritization of languages, as introducing 'en' even in the last position hurts extraction of dates that were extracted properly when English was not there.

import dateparser
dateparser.parse("11/12", languages=['en'])
Out[3]: datetime.datetime(2020, 11, 12, 0, 0)

This is right

dateparser.parse("11/12", languages=['es'])
Out[4]: datetime.datetime(2020, 12, 11, 0, 0)

This is also right, because the standard in Spain is DD/MM But now if we add English to the languages list in the last position...

dateparser.parse("11/12", languages=['es', 'en'])
Out[5]: datetime.datetime(2020, 11, 12, 0, 0)

We got it parsed like in English, even if Spanish is first in the list of languages. This is unexpected to me, I would have expected prioritizing Spanish instead.

ivanprado avatar Aug 21 '20 11:08 ivanprado

Hi @ivanprado!

The currently used order is that order defined in dateparser/dateparser/data/languages_info.py (FYI, this order is being questioned in this issue (https://github.com/scrapinghub/dateparser/issues/714) and it will probably change).

However, I agree that people using the languages parameter could expect it to respect the defined order. So we should probably change this behavior or document it.

This could be addressed by adding a setting (called, for example, USE_GIVEN_LANGUAGE_ORDER) and possibly making it True by default. The logic to implement this is practically finished, as we have the use_given_order property when creating the DateDataParser object, and we should just add a small portion of code and tests to allow this setting to work.

I will tag this as good_first_issue and I expect this to be solved before the end of October (or should I say "Hacktoberfest")? :slightly_smiling_face:

Thank you for your comment! :smile:

noviluni avatar Aug 25 '20 14:08 noviluni

Thank you @noviluni, this sounds great :smile: .

Just to give you more context: The idea is that the languages provided to dateparser can come from a language detector model run over the page (this is my case). This gives you a list of languages, ordered by probability: the first one is the more probable, then the second one, etc.

So the idea is to give them to dateparser as a language hint, and the order is important in this case. Probably USE_GIVEN_LANGUAGE_ORDER will help in the case I'm describing. :+1:

ivanprado avatar Aug 25 '20 15:08 ivanprado

Hi @noviluni @ivanprado ,

I added a pr for this, please take a look and let me know what you think about this implementation and I will fix and add some tests for it as well if so.

I believe the default should be False so that we make use of the list of most common languages for this @noviluni

dariuschira avatar Sep 17 '20 15:09 dariuschira

We can close this issue because it was fixed in https://github.com/scrapinghub/dateparser/pull/805 and https://github.com/scrapinghub/dateparser/issues/845

In [1]: import dateparser

In [2]: dateparser.parse("11/12", languages=['en'])
Out[2]: datetime.datetime(2022, 11, 12, 0, 0)

In [3]: dateparser.parse("11/12", languages=['es'])
Out[3]: datetime.datetime(2022, 12, 11, 0, 0)

In [4]: dateparser.parse("11/12", languages=['es', 'en'])
Out[4]: datetime.datetime(2022, 11, 12, 0, 0)

In [5]: dateparser.parse("11/12", languages=['en', 'es'])
Out[5]: datetime.datetime(2022, 11, 12, 0, 0)

In [6]: dateparser.__version__
Out[6]: '1.1.4'

serhii73 avatar Dec 01 '22 08:12 serhii73

@serhii73 Your output shows that it is still not fixed. Out [4] will match Out [3] once the issue is fixed,

Gallaecio avatar Dec 01 '22 09:12 Gallaecio

Yes, you're right @Gallaecio

serhii73 avatar Dec 01 '22 10:12 serhii73