Parse unix timestamps consistently regardless of timezones

Open onlynone opened this issue 2 years ago • 2 comments

When dateparser parses a unix timestamp, and the user has set a TIMEZONE that is not local (or the same as the user's system local timezone), it gets the wrong time. This is because datetime.fromtimestamp(timestamp) will return a datetime in the user's local timezone as configured by their system (influenced by the TZ env var, /etc/localtime, etc). But, it will be a naive datetime without any timezone information . The dateparser code will try to interpret that datetime object as if it was in the settings.TIMEZONE timezone, which is incorrect.

The fix is to use datetime.fromtimestamp(timestamp, timezone_that_matches_settings_TIMEZONE) to get a timezone aware datetime object with a timezone that matches dateparser's settings.TIMEZONE (or use the local timezone if that is unset). Then strip the tzinfo from that datetime object, because apply_timezone_from_settings(date_obj, settings) requires date_obj to be a naive datetime without timezone information. So when apply_timezone_from_settings interprets that object, it does so correctly.

Jul 27 '21 22:07 onlynone

Codecov Report

Merging #954 (6934648) into master (0ed979e) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #954   +/-   ##
=======================================
  Coverage   98.29%   98.29%           
=======================================
  Files         234      234           
  Lines        2702     2705    +3     
=======================================
+ Hits         2656     2659    +3     
  Misses         46       46

Impacted Files	Coverage Δ
dateparser/date.py	`99.27% <100.00%> (+<0.01%)`	:arrow_up:
dateparser/utils/__init__.py	`95.27% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Jul 27 '21 23:07 codecov[bot]

By the way, when I was adding tests, I tried to follow the pattern of the other tests in TestTimestampParser by calling date.get_date_from_timestamp directly, and passing in a custom value for settings. I also saw a pattern for creating a new Settings object, in another test in tests/test_date.py of doing something like:

        settings_mod = copy(settings)
        settings_mod.RETURN_AS_TIMEZONE_AWARE = True
        settings_mod.TIMEZONE = timezone
        settings_mod.TO_TIMEZONE = to_timezone

        date.get_date_from_timestamp(timestamp, settings_mod)

But, when I do that, it seems like I end up modifying the global settings object imported from dateparser.conf and that caused other tests to fail because they weren't expecting RETURN_AS_TIMEZONE_AWARE to be True. Is dateparser.conf.Settings a singleton? Should the other test that tries to do a copy be fixed? Should there be a more standard way to create a custom/one-off settings object for use in tests?

Jul 28 '21 15:07 onlynone

@Gallaecio Is this good to merge?

Aug 26 '22 14:08 onlynone

TBH, I am not able to understand the issue here. Maybe an example may help.

But if we are applying the timezone and stripping it off, to eventually apply the timezone from settings through apply_timezone_from_settings , how does the behavior change?

Also the statement in the initial explanation

The dateparser code will try to interpret that datetime object as if it was in the settings.TIMEZONE timezone, which is incorrect

Where do we see that behavior, since the apply_tiemzone_from_settings would apply the timezone from the settings, I believe?

Aug 31 '22 14:08 gutsytechster

The examples in the added tests should show the issue. You could try running those tests on the main line code without my changes.

But if we are applying the timezone and stripping it off, to eventually apply the timezone from settings through apply_timezone_from_settings , how does the behavior change?

The problem is that there's not just one time zone that's being stripped and reapplied. The date is initially constructed with the time zone of the system (regardless of any dateparser settings), but then that time zone is not kept on the datetime. Then later in the code, the datetime is interpreted as if it was in dateparser's settings.TIMEZONE. If the system's time zone happens to be the same as the one used for dateparser's settings, then I think everything is fine. But if my system is set to 'America/New_York', and I set the dateparser settings time zone to 'UTC' and ask it to parse '1661996156', I believe dateparser will currently give me 'August 31, 2022 9:35:56 PM UTC' when it should give me 'September 1, 2022 1:35:56 AM UTC'.

Sorry if there are any errors in the above comment, I'm on mobile and it's also been a while since I first made this branch. Please test it the examples and see if what I'm saying makes sense.

Sep 01 '22 01:09 onlynone

Where do we see that behavior, since the apply_tiemzone_from_settings would apply the timezone from the settings, I believe?

Yes, but with the new code, the naive datetime object is created in the same time zone as settings.TIMEZONE. so that later, when it's interpreted as if it was in settings.TIMEZONE, that's now a correct assumption. With the old code, the naive datetime is created relative to the system's time zone.

Sep 01 '22 01:09 onlynone

Here's an example. With the current code:

>>> dateparser.parse('1661996156', settings={'TIMEZONE': 'UTC', 'TO_TIMEZONE': 'UTC', 'RETURN_AS_TIMEZONE_AWARE': True}).timestamp()
1661981756.0

The returned unix timestamp doesn't match the original timestamp. And with my branch:

>>> dateparser.parse('1661996156', settings={'TIMEZONE': 'UTC', 'TO_TIMEZONE': 'UTC', 'RETURN_AS_TIMEZONE_AWARE': True}).timestamp()
1661996156.0

It does match.

Sep 01 '22 13:09 onlynone

Thanks @onlynone!

It's clear now. I understand where the difference comes from.

As far as I understood, previous to, when the date goes to the apply_timezone_from_settings, it got converted into a date object with a local timezone(without timezone included), and that date is further used for applying timezone from the mentioned method.

Sep 02 '22 14:09 gutsytechster

@gutsytechster @Gallaecio I think everything has been resolved, can it be merged?

Sep 16 '22 14:09 onlynone

Thank you!

Sep 20 '22 06:09 Gallaecio

Using an old version of tzlocal, 2.1 made dateparser return Local Mean Time instead of the correct timezone as the result of this change. After they switched away from pytz to zoneinfo that problem went away. You might want to bump your minimum tzlocal version dependency.

Oct 22 '22 21:10 ghazel

dateparser dateparser copied to clipboard

Parse unix timestamps consistently regardless of timezones

Codecov Report

dateparser
dateparser copied to clipboard