dateparser icon indicating copy to clipboard operation
dateparser copied to clipboard

"dateparser.parse" func doesn't work well in docker container env

Open ferstar opened this issue 3 years ago • 8 comments

Here is my sample test codes:

import dateparser

if __name__ == '__main__':
    s = '2019年5月30日'
    date = dateparser.parse(s)
    assert date is not None and date.year == 2019 and date.month == 5 and date.day == 30

the codes above works well as expected in any host(windows/linux/macos) env, but when runs in a docker container env, it breaks down

my Dockerfile

FROM ubuntu:18.04

COPY ./test.py /root/test.py

ENV DEBIAN_FRONTEND=noninteractive \
    LANG=en_US.UTF-8

RUN apt-get update -y && \
    apt-get install --no-install-recommends -y vim.tiny tzdata locales python3.6 python3-pip python3.6-dev && \
    ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && \
    locale-gen en_US.UTF-8 && \
    pip3 install --upgrade pip && \
    pip3 install --no-cache-dir dateparser && \
    rm -rf /var/lib/apt/lists/* && \
    apt clean


CMD ["sh", "-c", "/usr/bin/python3 /root/test.py"]

build & run command:

docker build -t py3/test . && docker run py3/test

error msg:

Traceback (most recent call last):
  File "/root/test.py", line 7, in <module>
    assert date is not None and date.year == 2021 and date.month == 3 and date.day == 12
AssertionError

seems like that the dateparser.parse(s) func returns a None object

ferstar avatar Mar 12 '21 08:03 ferstar

Is the same Dateparser version run locally and in the Docker image? Is the Python version the same? Does it work with the master branch of Dateparser? Does it work with a more recent version of Ubuntu configured in the Docker image?

Gallaecio avatar Mar 12 '21 10:03 Gallaecio

Is the same Dateparser version run locally and in the Docker image? Is the Python version the same? Does it work with the master branch of Dateparser? Does it work with a more recent version of Ubuntu configured in the Docker image?

  1. Is the same Dateparser version run locally and in the Docker image? yes, both of all the host and docker container dateparser version is 1.0.0, ofcause.
  2. Is the Python version the same? yes, py version 3.6.9
  3. Does it work with the master branch of Dateparser? no
  4. Does it work with a more recent version of Ubuntu configured in the Docker image? no

BTW, all of your questions above can be answered in my Dockerfile

ferstar avatar Mar 12 '21 11:03 ferstar

Well, after several hours of debugging, I found the reason:

https://github.com/scrapinghub/dateparser/blob/43fbc6a39f46a05ed2c74091275f4a198eb32ce6/dateparser/freshness_date_parser.py#L93

call this func

https://github.com/scrapinghub/dateparser/blob/43fbc6a39f46a05ed2c74091275f4a198eb32ce6/dateparser/freshness_date_parser.py#L41

and get_localzone() func raise a ValueError exception:

  File "/usr/local/lib/python3.6/dist-packages/dateparser/freshness_date_parser.py", line 43, in get_local_tz
    return get_localzone()
  File "/usr/local/lib/python3.6/dist-packages/tzlocal/unix.py", line 165, in get_localzone
    _cache_tz = _get_localzone()
  File "/usr/local/lib/python3.6/dist-packages/tzlocal/unix.py", line 90, in _get_localzone
    utils.assert_tz_offset(tz)
  File "/usr/local/lib/python3.6/dist-packages/tzlocal/utils.py", line 46, in assert_tz_offset
    raise ValueError(msg)
ValueError: Timezone offset does not match system offset: 0 != 28800. Please, check your config files.

the above exception was catched by this line:

https://github.com/scrapinghub/dateparser/blob/43fbc6a39f46a05ed2c74091275f4a198eb32ce6/dateparser/date.py#L196

this caused a None value returned

So, the solution is very simple:

Keep the timezone offset same as the system offset

ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
echo "Asia/Shanghai"> /etc/timezone
# or export TZ=Asia/Shanghai

ferstar avatar Mar 12 '21 11:03 ferstar

could this be the reason behind this? :thinking: https://github.com/scrapinghub/dateparser/issues/857

noviluni avatar Mar 12 '21 11:03 noviluni

could this be the reason behind this? 🤔 #857

yep, and I recommend printing the exception here instead of eat it with None:-)

https://github.com/scrapinghub/dateparser/blob/43fbc6a39f46a05ed2c74091275f4a198eb32ce6/dateparser/date.py#L196

...
    def _try_freshness_parser(self):
        try:
            return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
        except (OverflowError, ValueError) as exp:
            self.get_logger().error(str(exp))  # warning people that be careful with their system TZ config
            return None
...

ferstar avatar Mar 12 '21 11:03 ferstar

Thanks for your debugging @ferstar, saved me a LOT of time with this issue. Completely agree, simply returning None is very confusing for the user here, and some log statement would be more appropriate.

From a user-standpoint, if you're working inside a container/deployed on a cluster then it's probably safer to work with UTC, or a specific, known timezone if that's what your application needs. This means the above issue (using your example) can be mitigated with the following:

from datetime import datetime
from zoneinfo import ZoneInfo
import dateparser

if __name__ == '__main__':
    s = '2019年5月30日'
    tz = ZoneInfo('Asia/Shanghai')
    relative_base = datetime.now(tz=tz)
    settings = {
        'RELATIVE_BASE': relative_base,
        'TIMEZONE': str(tz)
    }
    date = dateparser.parse(s, settings=settings)
    assert date is not None and date.year == 2019 and date.month == 5 and date.day == 30

jamescw19 avatar Dec 16 '21 11:12 jamescw19

@jammie19 Nice solution, thx

ferstar avatar Jan 15 '22 08:01 ferstar