dateparser icon indicating copy to clipboard operation
dateparser copied to clipboard

Italian date with format "%a, %d %b %Y %H:%M:%S %z" is not working

Open osvill opened this issue 3 years ago • 4 comments

I tried following code with the versions 1.0.0 1.1.0 and 1.1.1 on Windows with python 3.7.4 and all three versions have the same output of None: dateparser.parse('mar, 07 giu 2022 08:56:47 +0200') dateparser.parse('mar, 07 giu 2022 08:56:47 +0200', date_formats=['%a, %d %b %Y %H:%M:%S %z']) dateparser.parse('mar, 07 giu 2022 08:56:47 +0200', date_formats=['%a, %d %b %Y %H:%M:%S %z'], languages=['it'])

What am I missing?

osvill avatar Jun 08 '22 20:06 osvill

Hi @osvill

From my understanding, the characters don't represent the expectation in the date string.

For e.g. I tried the examples you provided, and indeed I was getting None as output. However, I then tried removing the weekday, and it returns the expected output

>>> dateparser.parse('07 giu 2022 08:56:47 +0200')
datetime.datetime(2022, 6, 7, 8, 56, 47, tzinfo=<StaticTzInfo 'UTC\+02:00'>)

I analyze that the issue is somewhere with the weekday. I then looked for its translation in Italian. Apparently, it is trying to represent Tuesday, but when I translated it, the correct translation came out to be martedì. After using this translated weekday, I get the perfect results.

>>> dateparser.parse('martedì, 07 giu 2022 08:56:47 +0200')
datetime.datetime(2022, 6, 7, 8, 56, 47, tzinfo=<StaticTzInfo 'UTC\+02:00'>)

Let me know if it was helpful.

gutsytechster avatar Jun 16 '22 16:06 gutsytechster

I actually tried experimenting with this example a bit. I found that when you provide the date_formats parameter to the parse method, it goes to this function in the source, and eventually runs the datetime.strptime() method.

I realized that this function will pick the default locale set on your system. So for me, it was en_EN. When I tried with the correct translated input on my system, it failed with a ValueError.

>>> datetime.strptime('martedì, 07 giu 2022 08:56:47 +0200', '%a, %d %b %Y %H:%M:%S %z')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.10/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/usr/lib64/python3.10/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %

However, it was perfectly returning the output if I set the input string to its corresponding English locale.

>>> datetime.strptime('Tue, 07 Jun 2022', '%a, %d %b %Y')
datetime.datetime(2022, 6, 7, 0, 0)

I then performed a small experiment to change the locale and then observe the behaviour. I did the following steps for the same

>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'it_IT.UTF-8')
'it_IT.UTF-8'
>>> locale.getlocale()
('it_IT', 'UTF-8')

Now when I run even the incorrectly translated(as what I found in the online translation) date format, it gives the correct results

>>> date_string = 'mar, 07 giu 2022 08:56:47 +0200'
>>> datetime.strptime(date_string, '%a, %d %b %Y %H:%M:%S %z')
datetime.datetime(2022, 6, 7, 8, 56, 47, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))

Hence, one thing I don't understand is that if we set the locale to IT, it fetches the correct result, then why dateparser fails to do so when applying the IT locales to the same string?

@Gallaecio any thoughts?

gutsytechster avatar Jun 16 '22 16:06 gutsytechster

I am not familiar enough with the code base to answer that :disappointed:

Gallaecio avatar Jun 16 '22 18:06 Gallaecio

@gutsytechster Thanks a lot for your research. Your work pointed me in the right direction and I may found the culprit. Down the way this function returns the translated date string as 'march 07 june 2022 08:56:47 +0200'. There we can see that the short form of martedì mar (like Tue of Tuesday) is not correctly translated. Instead it thinks mar stands for marzo (march) as you can see here (Note: mar appears a second time further down for tuesday as you can see here). I didn't have enough time to go deeper, but I expected, that given the format of '%a, %d %b %Y %H:%M:%S %z' indicates to not take mar as month because of the %a.

Here is some code to reproduce my results:

from dateparser.conf import settings as sts
from dateparser.date import DateDataParser, _DateLocaleParser


date_string = 'mar, 07 giu 2022 08:56:47 +0200'
fmt = '%a, %d %b %Y %H:%M:%S %z'
ddp = DateDataParser(languages=['it'], settings=sts)

locale_list = [locale for locale in ddp._get_applicable_locales(date_string)]
	
dlp = _DateLocaleParser(locale_list[0], date_string, [fmt], settings=sts)

td = dlp._get_translated_date_with_formatting()

print(td)

# Translation
dictionary = dlp.locale._get_dictionary(sts)
date_string_tokens = dictionary.split(date_string, keep_format=True)
relative_translations = dlp.locale._get_relative_translations(settings=sts)

for i, word in enumerate(date_string_tokens):
    word = word.lower()
    print(f'word= {word}')
    for pattern, replacement in relative_translations.items():
        if pattern.match(word):
            print(f'pattern #{i} matched for word {word}.')
            date_string_tokens[i] = pattern.sub(replacement, word)
    else:
        if word in dictionary:
            print(f'word {word} in dictionary')
            print(f'dictionary[word]= {dictionary[word]}')
            date_string_tokens[i] = dictionary[word] or ''

Overview of function calls of dateparser.parse(..):

dateparser.parse
|- get_date_data: dateparser/__init__.py line 61
   |- _DateLocaleParser.parse: dateparser/date.py line 428
      |- _parse: dateparser/date.py line 180
	    |- self._try_given_formats: dateparser/date.py line 170
		   |- self.__get_translated_date_with_formatting: dateparser/date.py line 228
		      |- self.locale.translate: dateparser/date.py line 240
			     |- translate: dateparser/languages/locale.py line 110

osvill avatar Jun 17 '22 11:06 osvill