dateparser
dateparser copied to clipboard
"dateparser.parse" func doesn't work well in docker container env
Here is my sample test codes:
import dateparser
if __name__ == '__main__':
s = '2019年5月30日'
date = dateparser.parse(s)
assert date is not None and date.year == 2019 and date.month == 5 and date.day == 30
the codes above works well as expected in any host(windows/linux/macos) env, but when runs in a docker container env, it breaks down
my Dockerfile
FROM ubuntu:18.04
COPY ./test.py /root/test.py
ENV DEBIAN_FRONTEND=noninteractive \
LANG=en_US.UTF-8
RUN apt-get update -y && \
apt-get install --no-install-recommends -y vim.tiny tzdata locales python3.6 python3-pip python3.6-dev && \
ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && \
locale-gen en_US.UTF-8 && \
pip3 install --upgrade pip && \
pip3 install --no-cache-dir dateparser && \
rm -rf /var/lib/apt/lists/* && \
apt clean
CMD ["sh", "-c", "/usr/bin/python3 /root/test.py"]
build & run command:
docker build -t py3/test . && docker run py3/test
error msg:
Traceback (most recent call last):
File "/root/test.py", line 7, in <module>
assert date is not None and date.year == 2021 and date.month == 3 and date.day == 12
AssertionError
seems like that the dateparser.parse(s)
func returns a None
object
Is the same Dateparser version run locally and in the Docker image? Is the Python version the same? Does it work with the master
branch of Dateparser? Does it work with a more recent version of Ubuntu configured in the Docker image?
Is the same Dateparser version run locally and in the Docker image? Is the Python version the same? Does it work with the
master
branch of Dateparser? Does it work with a more recent version of Ubuntu configured in the Docker image?
- Is the same Dateparser version run locally and in the Docker image? yes, both of all the host and docker container dateparser version is 1.0.0, ofcause.
- Is the Python version the same? yes, py version
3.6.9
- Does it work with the
master
branch of Dateparser? no - Does it work with a more recent version of Ubuntu configured in the Docker image? no
BTW, all of your questions above can be answered in my
Dockerfile
Well, after several hours of debugging, I found the reason:
https://github.com/scrapinghub/dateparser/blob/43fbc6a39f46a05ed2c74091275f4a198eb32ce6/dateparser/freshness_date_parser.py#L93
call this func
https://github.com/scrapinghub/dateparser/blob/43fbc6a39f46a05ed2c74091275f4a198eb32ce6/dateparser/freshness_date_parser.py#L41
and get_localzone()
func raise a ValueError
exception:
File "/usr/local/lib/python3.6/dist-packages/dateparser/freshness_date_parser.py", line 43, in get_local_tz
return get_localzone()
File "/usr/local/lib/python3.6/dist-packages/tzlocal/unix.py", line 165, in get_localzone
_cache_tz = _get_localzone()
File "/usr/local/lib/python3.6/dist-packages/tzlocal/unix.py", line 90, in _get_localzone
utils.assert_tz_offset(tz)
File "/usr/local/lib/python3.6/dist-packages/tzlocal/utils.py", line 46, in assert_tz_offset
raise ValueError(msg)
ValueError: Timezone offset does not match system offset: 0 != 28800. Please, check your config files.
the above exception was catched by this line:
https://github.com/scrapinghub/dateparser/blob/43fbc6a39f46a05ed2c74091275f4a198eb32ce6/dateparser/date.py#L196
this caused a None
value returned
So, the solution is very simple:
Keep the timezone offset same as the system offset
ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
echo "Asia/Shanghai"> /etc/timezone
# or export TZ=Asia/Shanghai
could this be the reason behind this? :thinking: https://github.com/scrapinghub/dateparser/issues/857
could this be the reason behind this? 🤔 #857
yep, and I recommend printing the exception here instead of eat it with None
:-)
https://github.com/scrapinghub/dateparser/blob/43fbc6a39f46a05ed2c74091275f4a198eb32ce6/dateparser/date.py#L196
...
def _try_freshness_parser(self):
try:
return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
except (OverflowError, ValueError) as exp:
self.get_logger().error(str(exp)) # warning people that be careful with their system TZ config
return None
...
Thanks for your debugging @ferstar, saved me a LOT of time with this issue.
Completely agree, simply returning None
is very confusing for the user here, and some log statement would be more appropriate.
From a user-standpoint, if you're working inside a container/deployed on a cluster then it's probably safer to work with UTC, or a specific, known timezone if that's what your application needs. This means the above issue (using your example) can be mitigated with the following:
from datetime import datetime
from zoneinfo import ZoneInfo
import dateparser
if __name__ == '__main__':
s = '2019年5月30日'
tz = ZoneInfo('Asia/Shanghai')
relative_base = datetime.now(tz=tz)
settings = {
'RELATIVE_BASE': relative_base,
'TIMEZONE': str(tz)
}
date = dateparser.parse(s, settings=settings)
assert date is not None and date.year == 2019 and date.month == 5 and date.day == 30
@jammie19 Nice solution, thx