pytz icon indicating copy to clipboard operation
pytz copied to clipboard

dst() occasionally returns a possibly too-high value

Open pganssle opened this issue 5 years ago • 2 comments

Running this script with pytz==2020.1 and backports.zoneinfo==0.2.0, I find 34 examples of zones where pytz reports DST of > 1 hour, while backports.zoneinfo reports DST of 1 hour:

import pytz
from backports import zoneinfo
from datetime import *

discrepancy_keys = {}
for key in pytz.all_timezones:
    pz = pytz.timezone(key)
    zz = zoneinfo.ZoneInfo(key)
    for year in range(1900, 2040):
        pd_winter = pz.localize(datetime(year, 1, 1)).dst()
        zd_winter = datetime(year, 1, 1, tzinfo=zz).dst()

        pd_summer = pz.localize(datetime(year, 7, 1)).dst()
        zd_summer = datetime(year, 7, 1, tzinfo=zz).dst()

        if pd_winter == zd_winter and pd_summer == zd_summer:
            continue

        if abs(pd_winter) > timedelta(hours=1) or abs(pd_summer) > timedelta(hours=1):
            discrepancy_keys.setdefault(key, []).append(year)

Here's the list:

{'America/Argentina/Catamarca': [1992],
 'America/Argentina/ComodRivadavia': [1992],
 'America/Argentina/Cordoba': [1992],
 'America/Argentina/Jujuy': [1992],
 'America/Argentina/Mendoza': [1993],
 'America/Argentina/Salta': [1992],
 'America/Argentina/Tucuman': [1992],
 'America/Catamarca': [1992],
 'America/Cordoba': [1992],
 'America/Indiana/Winamac': [2007],
 'America/Jujuy': [1992],
 'America/Mendoza': [1993],
 'America/Rosario': [1992],
 'Atlantic/Azores': [1942, 1943, 1944, 1945],
 'Atlantic/Madeira': [1942, 1943, 1944, 1945],
 'Europe/Belfast': [1941, 1942, 1943, 1944, 1945, 1947],
 'Europe/Berlin': [1945],
 'Europe/Brussels': [1940, 1941, 1942],
 'Europe/Gibraltar': [1941, 1942, 1943, 1944, 1945, 1947],
 'Europe/Guernsey': [1941, 1942, 1943, 1944, 1945, 1947],
 'Europe/Isle_of_Man': [1941, 1942, 1943, 1944, 1945, 1947],
 'Europe/Jersey': [1941, 1942, 1943, 1944, 1945, 1947],
 'Europe/Lisbon': [1942, 1943, 1944, 1945],
 'Europe/London': [1941, 1942, 1943, 1944, 1945, 1947],
 'Europe/Luxembourg': [1940, 1941, 1942],
 'Europe/Madrid': [1938],
 'Europe/Monaco': [1941, 1942, 1943, 1944, 1945],
 'Europe/Moscow': [1921],
 'Europe/Paris': [1940, 1941, 1942],
 'Europe/Simferopol': [1994],
 'GB': [1941, 1942, 1943, 1944, 1945, 1947],
 'GB-Eire': [1941, 1942, 1943, 1944, 1945, 1947],
 'Portugal': [1942, 1943, 1944, 1945],
 'W-SU': [1921]}

Looking at one example, America/Cordoba in 1991-1992, it looks like they changed their base offset during the 1991 transition¹, so they jumped ahead 2 hours, then jumped back only 1 hour in the next transition, at which point they had 1 more transition to +3 and then stopped doing DST:

America/Cordoba  Sun Mar  3 01:59:59 1991 UT = Sat Mar  2 23:59:59 1991 -02 isdst=1 gmtoff=-7200
America/Cordoba  Sun Mar  3 02:00:00 1991 UT = Sat Mar  2 22:00:00 1991 -04 isdst=0 gmtoff=-14400
America/Cordoba  Sun Oct 20 03:59:59 1991 UT = Sat Oct 19 23:59:59 1991 -04 isdst=0 gmtoff=-14400
America/Cordoba  Sun Oct 20 04:00:00 1991 UT = Sun Oct 20 02:00:00 1991 -02 isdst=1 gmtoff=-7200
America/Cordoba  Sun Mar  1 01:59:59 1992 UT = Sat Feb 29 23:59:59 1992 -02 isdst=1 gmtoff=-7200
America/Cordoba  Sun Mar  1 02:00:00 1992 UT = Sat Feb 29 23:00:00 1992 -03 isdst=0 gmtoff=-10800
America/Cordoba  Sun Oct 18 02:59:59 1992 UT = Sat Oct 17 23:59:59 1992 -03 isdst=0 gmtoff=-10800
America/Cordoba  Sun Oct 18 03:00:00 1992 UT = Sun Oct 18 01:00:00 1992 -02 isdst=1 gmtoff=-7200
America/Cordoba  Sun Mar  7 01:59:59 1993 UT = Sat Mar  6 23:59:59 1993 -02 isdst=1 gmtoff=-7200
America/Cordoba  Sun Mar  7 02:00:00 1993 UT = Sat Mar  6 23:00:00 1993 -03 isdst=0 gmtoff=-10800

Looking at the way this is done in zoneinfo, it seems like it might be just lucky that the algorithm assigns the DST offset based on the DST→STD transition and not on the STD → DST transition.

It may be better to check both the DST → STD and STD → DST transitions and see if there's disagreement, and if one of them assigns 1 hour, choose that one. America/Indiana/Winamac had a similar transition in 2007 as well. What I find interesting with America/Indiana/Winamac is that pytz gets dst() wrong in 2007, but not in 2008, when it uses EST again. Not sure what that's about.

Looks like Europe/Simferopol is a similar situation, except it goes EET (+2, STD) → EEST (+3, DST) → MSD (+4, DST) → MSK (+3, STD), with those last two repeating afterwards.

Europe/Madrid is a bug in zoneinfo, they have STD → DST → Double DST → DST → STD, and zoneinfo doesn't pick up on that.

¹Awesomely, it seems that America/Cordoba also underwent their transition at the amazing edge case of February 29th at midnight.

pganssle avatar Jun 09 '20 19:06 pganssle

The DST offset is somewhat ill defined, as it is the difference between the wallclock time and the 'standard' offset. The 'standard' offset can be debatable. The offset before the dst period? The offset after the dst period? And what about if there are multiple different DST periods without 'standard' periods, or the reverse?

pytz never actually cared, and the values from utcoffset and dst boiled down to whatever truth or lies the datetime library needed to be told to get arithmetic and timezone conversions correct. The new zoneinfo library should not need to make that compromise, and hopefully give saner results (as far as possible in this insane domain).

(I have not looked closely at the examples yet, and do not know if the discrepancies are because 'it depends', or if pytz or zoneinfo are wrong)

stub42 avatar Jun 10 '20 06:06 stub42

The DST offset is somewhat ill defined, as it is the difference between the wallclock time and the 'standard' offset. The 'standard' offset can be debatable. The offset before the dst period? The offset after the dst period? And what about if there are multiple different DST periods without 'standard' periods, or the reverse?

I tend to agree that DST is a very leaky abstraction, and also I don't find it useful for anything except for compiling lists of weird time zone edge cases. That said, there is a canonical answer here, because the amount of DST is encoded in the source material (just not the zic-compiled binaries). The rule for Europe/Simferopol is here, for example.

In the process of creating some shims to deprecate localize and normalize, I ran some property tests that (because of the nature of the shims), end up testing dst() results for zoneinfo.ZoneInfo, pytz and dateutil.tz.gettz, (within the range in which the V1 and V2 TZif files are identical) and I found a bunch of edge cases where pytz, dateutil and zoneinfo get it wrong (the nature of the tests is such that I wouldn't detect any cases where all three get it wrong, so there may be more 😅). I'm hoping to come up with a simple set of heuristics that will cover at least all the existing rules.

That said, even if we give up on the goal of having dst() work as specified in the documentation, I still think it would make sense to aim for dst() to return a truthy answer when isdst=1 and a falsy answer when isdst=0, for example in 1999 Argentina decided to switch to using DST, but didn't change their UTC offset, then they decided not to do DST anymore, so at the next transition they just went from considering it DST to considering it STD. This feels like a very weird thing to include in the time zone database, and it does feel like an "it depends" kind of situation, but on the other hand it's in there, and it shows up in zdump:

$ zdump -V -c1998,2001 'America/Buenos_Aires'
America/Buenos_Aires  Sun Oct  3 02:59:59 1999 UT = Sat Oct  2 23:59:59 1999 -03 isdst=0 gmtoff=-10800
America/Buenos_Aires  Sun Oct  3 03:00:00 1999 UT = Sun Oct  3 00:00:00 1999 -03 isdst=1 gmtoff=-10800
America/Buenos_Aires  Fri Mar  3 02:59:59 2000 UT = Thu Mar  2 23:59:59 2000 -03 isdst=1 gmtoff=-10800
America/Buenos_Aires  Fri Mar  3 03:00:00 2000 UT = Fri Mar  3 00:00:00 2000 -03 isdst=0 gmtoff=-10800

In zoneinfo.ZoneInfo, the heuristic I used is that if something has isdst=1 and I can't find a non-zero DST offset, I go with 1 hour, which is usually correct.

The new zoneinfo library should not need to make that compromise, and hopefully give saner results (as far as possible in this insane domain).

I'm not sure what you mean by "should not need to make that compromise". The new zoneinfo module and pytz both use the same underlying data (in fact in some respects zoneinfo has less information, because it updates independently of the time zone data, and as a result I would only have the option to look at tzdata.zi or one of the input sources if it happens to be deployed).

My rough plan right now is to try to get access to the time zone data in a format that preserves the amount of DST offset (I might be able to achieve that with dateutil.tz.tzical and a suitable ical compiler) and use that to run exhaustive tests on the full TZ database while I work out heuristics what combination of heuristics gives us the right answer the largest amount of the time.

pganssle avatar Jun 10 '20 12:06 pganssle