pandas icon indicating copy to clipboard operation
pandas copied to clipboard

ENH: timezone-aware `Period`

Open rwijtvliet opened this issue 3 years ago • 30 comments

Is your feature request related to a problem?

A Timestamp with a freq attribute can be used as a replacement/workaround for a Period, if time-zone awareness is needed. Below is an example.

With the deprecation of the .freq attribute for individual timestamps, this workaround will stop working in a future release.

Describe the solution you'd like

Implementation of timezone-awareness in the Period class.

API breaking implications

Period and PeriodIndex initializers take an additional tz parameter.

Describe alternatives you've considered

The alternative is what I'm currently using (Timestamp.freq) but will be deprecated.

Additional context

Current implementation of this functionality:

import pandas as pd


def ts_right(ts_left: pd.Timestamp) -> pd.Timestamp:
    """Right-bound timestamp associated with a (left-bound) timestamp."""
    if ts_left.freq == "D":
        kwargs = {"days": 1}
    elif ts_left.freq == "MS":
        kwargs = {"months": 1}
    elif ts_left.freq == "QS":
        kwargs = {"months": 3}
    elif ts_left.freq == "AS":
        kwargs = {"years": 1}
    else:
        raise ValueError(f"Invalid frequency: {ts_left.freq}.")
    return ts_left + pd.DateOffset(**kwargs)


def duration(ts_left: pd.Timestamp) -> float:
    """Duration [h] associated with a timestamp."""
    return (ts_right(ts_left) - ts_left).total_seconds() / 3600


ts1 = pd.Timestamp("2022-03-01", tz="Europe/Berlin", freq="D")  # "Timestamp.freq is deprecated and will be removed in a future version"
ts_right(ts1)  # Timestamp('2022-03-02 00:00:00+0100', tz='Europe/Berlin')
duration(ts1)  # 24.0

ts2 = pd.Timestamp("2022-03-27", tz="Europe/Berlin", freq="D")  # "Timestamp.freq is deprecated and will be removed in a future version"
ts_right(ts2)  # Timestamp('2022-03-28 00:00:00+0200', tz='Europe/Berlin')
duration(ts2)  # 23.0

rwijtvliet avatar Jan 31 '22 11:01 rwijtvliet

Is there a reason you can't pass freq as a separate argument to ts_right?

jbrockmendel avatar Jan 31 '22 18:01 jbrockmendel

Thanks @jbrockmendel. The Timestamp instances are used throughout my code to indicate periods of various natural lengths - an hour, a day, a month, etc. Without the freq attribute, I'd need to start passing two pieces of information around, or think of a replacement objects, e.g. (Timestamp, freq)-tuples, or custom dataclass instances.

In addition, the integration with the pandas.DatetimeIndex would be lost. Currently, if a DatetimeIndex has the .freq attribute set, each element is a pandas.Timestamp instance with the same .freq attribute. I can't imagine how I'll deal with the situations where I currently take a single element from such an index, which will in the future lose that .freq attribute.

(Including where this is done under the hood, e.g. when selecting a single Series entry or DataFrame row with .loc or iloc. etc.)

Don't get me wrong, I agree that timestamps denote infinitely short moments in time and have no 'duration' associated with them. But it's a workaround until Periods become timezone-aware - and the mentioned deprecation puts me in a bit of a bind.

rwijtvliet avatar Feb 01 '22 13:02 rwijtvliet

Semantically Periods are spans of time (akin to timedeltas) which are supposed to be agnostic to timezone arithmetic in the first place, so I don't think it makes much sense to add timezone to a period

mroeschke avatar Feb 10 '22 23:02 mroeschke

Thanks for your reply @mroeschke. Could you please expand on that? I've seen similar comments before, but I never understand the reasoning behind them.

The analogy to timestamps is very clear to me:

  • A timestamp can be agnostic regarding UTC-offset, but, in order to unambiguously specify the moment that is meant, a UTC-offset must be included. 2022-02-11T15:00:00 (i.e., UTC-offset-agnostic) only works if the use-case implies a single non-changing timezone. If you tell me: "meet me (online) at 2022-02-11T15:00:00", it will not work unless you also tell me the UTC-offset. I think we all agree on this.
  • Analogously: a time period can be timezone-agnostic, but, in order to unambiguously specify the time span that is meant, a timezone must be included. The calendar day 2022-02-11 (i.e., timezone-agnostic) only works if the use-case implies a single non-changing timezone. If you tell me: "send me a message sometime during the calendar day 2022-02-11 (and not on the day before or the day after)" I can't tell when that period is exactly, unless you also tell me its timezone.

The natural relation between periods and timestamps is that some timestamps fall inside the period, and other do not. If the timestamp is UTC-offset-agnostic, I can verify if it falls inside a timezone-agnostic period. Likewise, I should be able to verify if a timestamp with a UTC-offset falls inside a timezone-aware period. I.e., the timestamp 2022-02-11T01:00:00+04:00 falls within the calendar day 2022-02-11 of the Aukland timezone, but not of the New York timezone.

rwijtvliet avatar Feb 11 '22 10:02 rwijtvliet

Now, that is not to say that I don't see that timezone-awareness in periods is more complex than in timestamps. Here are some of the difficulties:

  • I can convert a timestamp from one UTC-offset to another. The same is not possible for periods. The calendar day 2022-03-27 in the Aukland timezone cannot be converted to an equally natural period in the New York timezone, simply because it cuts across 2 calendar days in that timezone. However, this is not a bug, but a feature: it's simply the reality of the situation.

  • A timestamp belongs to exactly one timezone, expressed as offset from UTC, like +04:00 in my previous post. Periods can start with a timestamp having one offset, and end with a timestamp having another. For example, the 2022-03-27 in the Europe/Berlin timezone starts at 2022-03-27T00:00:00+01:00 and ends at 2022-03-28T00:00:00+02:00, and therefore only spans 23 * 3600 seconds, rather than the usual 24 * 3600 seconds. This too, however, is the reality of the situation, and not a bug but a feature.

NB: The latter example is what I'm currently using timezone-aware timestamps with frequency for. start = pd.Timestamp('2022-03-27 00:00', tz='Europe/Berlin', freq='D') gives me the correct start timestamp (2022-03-27 00:00:00+0100') and start + 1 * start.freq gives the correct end timestamp (2022-03-28 01:00:00+0200), note the difference in UTC offsets.

I hope I've been able to make my case :) But please share your perspectives and/or try to convince me otherwise.

rwijtvliet avatar Feb 11 '22 10:02 rwijtvliet

One final thought. I don't want to go into possible implementation details here, but I think it's important to realize that a timezone can be expressed in two ways: as an offset from UTC (e.g., '+01:00'), or as tied to a geographic location (e.g., 'Europe/Berlin').

  • The only way that makes sense for timezone-aware timestamps is the offset from UTC. Using the location can be ambiguous: 2022-10-30 02:30 (Europe/Berlin) can be either +01:00 or +02:00 due to the DST changeover

  • The only way that makes sense for timezone-aware periods is the geographic location, as the start and end timestamps might have distinct UTC-offsets, as in the example in my previous post.

rwijtvliet avatar Feb 11 '22 11:02 rwijtvliet

a period could have time zone aware end points but i don't see a good reason for a time zone aware duration (not to mention a massive increase in complexity)

same reason we are killing the freq attribute in a Timestamp which is an instant

jreback avatar Feb 11 '22 11:02 jreback

If the end points are timezone-aware, the duration (which I understand to be the timedelta duration = end-start) is already taken care of.

rwijtvliet avatar Feb 11 '22 11:02 rwijtvliet

exactly so why would you then need a time zone if a Period? or rather what does it mean at all? what if u have naive endpoints? what if it had time zone aware endpoints?

this is such a huge ambiguity that it is unworkable

jreback avatar Feb 11 '22 11:02 jreback

Let me try to make it more clear:

  • Currently implemented: Timezone-agnostic periods which have UTC-offset-agnostic start and end timestamps. E.g. the calendar day 2022-02-11 which is the right-open interval between timestamp 2022-02-11T00:00:00 (incl) and timestamp 2022-02-12T00:00:00 (excl).

    • Calculating the duration as the difference between the end and the start gives 24 * 3600 seconds.
    • This calculation always gives 24 * 3600 seconds, whichever day is taken.
  • My proposal: timezone-aware periods which have UTC-offset-aware start and end timestamps. E.g., the calendar day 2022-02-11 Europe/Berlin is the right-open interval between timestamp 2022-02-11T00:00:00+01:00 (incl) and timestamp 2022-02-12T00:00:00+01:00 (excl). Note that the period contains the geographic location and the timestamps contain the UTC offset.

    • Calculating the duration as the difference between the end and the start also gives 24 * 3600 seconds.
    • But using the example given before: the calendar day 2022-03-27 Europe/Berlin is from midnight in UTC +01:00 to midnight in UTC +02:00, and therefore gives a duration of 23 * 3600 seconds.

Could you please elaborate, where you think the ambiguity comes in?

If it's not clear, why such a timezone-aware period might be needed: it is needed whenever the actual duration is relevant in order to aggregate values. I will give some use-cases that I can think of:

  • A power plant produces a constant 100 MW of power throughout the year. We need to calculate, how much energy that is for each calendar day (or worse, calendar month) by multiplying the 100 MW with the number of seconds in each period.
  • Renting a device at a per-hour price of 10 USD/h. Renting the device for a calendar day costs me 240 USD on most days, but 230 USD or 250 USD if there is a summertime/wintertime changeover.
  • A worker has flexible working hours, and is paid a bonus based on the average number of gadget he produces per hour. On a day that has only 23 hours, he should not be expected to have the same output as on a day that has 24 hours.

Note that this is not much different from the case where months have variable lengths. Just that the variation comes from another source, namely that a timezone (Europe/Berlin) switches UTC offset. But these moments are well-known.

rwijtvliet avatar Feb 11 '22 11:02 rwijtvliet

@rwijtvliet you are missing the point

having start/end points as tz aware is possible ; these describe instants of time.

but have a tz attribute on a period itself doesn't make sense; it is literally a duration / time delta, which doesn't have a notion of a time zone

geographic location is also not relevant (well it maybe relevant to you but not to a generic concept; and out of scope)

jreback avatar Feb 11 '22 12:02 jreback

having start/end points as tz aware is possible ; these describe instants of time

@jreback can you show me an example of such a period with a tz-aware start/end point?

rwijtvliet avatar Feb 11 '22 13:02 rwijtvliet

having start/end points as tz aware is possible ; these describe instants of time

@jreback can you show me an example of such a period with a tz-aware start/end point?

These dont' exist but I suppose could. These would do almost exactly what you want and not break the idiom

jreback avatar Feb 11 '22 14:02 jreback

Do you have thoughts on how they would they be constructed or used? I can currently do something like p = pd.Period('2022-03-27'), and pandas correctly assumes I want a period with a duration of a day. I don't see (yet) how I could then set the timezone of each end point.

But if there is indeed a way to implement that for Period and PeriodIndex instances, while retaining their freq attribute, that would solve my problems and I could move away from the Timestamp + freq workaround I currently have.

rwijtvliet avatar Feb 11 '22 14:02 rwijtvliet

If the timezone is important for your application, since it sounds you are really interested in a start date w/ timezone + a duration as a "period", I think you may want to use Interval / IntervalIndex instead

mroeschke avatar Feb 11 '22 19:02 mroeschke

@rwijtvliet in the not-too-distant future we'll have Timestamps that have resolution other than nanosecond, e.g. second or millisecond, and those will continue to have timezone support. That is a bit like what you are suggesting, but won't go all the way up to e.g. MonthStart.

For your use case, could you just pin the freq to the Timestamp object under a new name? ts.old_freq = whatever

jbrockmendel avatar Feb 11 '22 21:02 jbrockmendel

As reference, R's lubridate package (modern date/times) has a interval object which is somewhat their equivalent of Period (they also have a period, but that's more like a timedelta, or rather dateutil's relativedelta), see https://lubridate.tidyverse.org/articles/lubridate.html#time-intervals. This works with tz-aware start / end timestamps (and the same is true for example for Java's jodatime). However, lubridate's interval is actually stored as start/end timestamps, rather than the freq+ordinal as our Periods. And thus storage-wise is like our pd.Interval. As @mroeschke mentions, that might also something to check if that could work for your application, and intervals already supports tz-aware left/right values (although it has for the rest no period/timestamp specific functionality).


I think it's very natural to want to have Periods that are (in some way) timezone-aware. Thanks @rwijtvliet for making the case.

I only don't know if the Period class as it is currently implemented is suited to add something like that (or whether it would be worth the complexity).

jorisvandenbossche avatar Feb 11 '22 21:02 jorisvandenbossche

Ok thanks guys for all your replies.

My current takeaway is

  • Timestamp and DatetimeIndex can handle UTC-offsets, but describe instants in time. They shouldn't be used to describe intervals in time (no discussion there), and I don't see a way to keep using them for that once freq is deprecated. (Appending old_freq is not an option when indexing from a DatetimeIndex.)

  • Period and PeriodIndex are intervals in time, but cannot handle timezones, and are therefore not the correct class for my use-case either. Timezone-awareness is beyond their scope and will not be implemented.

rwijtvliet avatar Feb 17 '22 10:02 rwijtvliet

I had a quick look at Interval and IntervalIndex to see if they might be the thing I should use:

>>> i = pd.date_range('2022', freq='MS', periods=5, tz='Europe/Berlin')
>>> idx = pd.IntervalIndex.from_arrays(i, i + i.freq, closed='left')
>>> idx[2].left, idx[2].right # so far, so good
Timestamp('2022-03-01 00:00:00+0100', tz='Europe/Berlin', freq='MS'), Timestamp('2022-04-01 00:00:00+0200', tz='Europe/Berlin') # (the 'freq' attribute will deprecate shortly)
>>> s = pd.Series([1,2,30,400,-500], idx)

Some quick questions for me to judge their usefulness:

  • Can I resample s to daily or quarterly values?

  • Is there a way to get, from idx and any of its Interval elements, the information that each interval is a calendar month? Other than passing it through a long if elif elif block, like so:

def freq(i:pd.Interval):
    if pd.Timedelta(hours=23) <= i.length <= pd.Timedelta(hours=25): # got to consider variable day length due to DST-changeover
        return pd.offsets.Day()
    elif pd.Timedelta(days=27, hours=23) <= i.length <= pd.Timedelta(days=31, hours=1): # got to consider variable number of days and their variable length due to DST-changeover
        if i.left.day == 1 and i.left == i.left.floor('D'):
           return pd.offsets.MonthBegin()
        # ... and many other checks for various edge cases and particulars of time intervals based on human calendars.

If I'm honest: even if I get this to work, and I find a way to use Interval and IntervalIndex to represent timezone-aware time periods... i'm afraid it will be just as bad a workaround (to Period and PeriodIndex not being timezone-aware) as using Timestamp and DatetimeIndex is.

rwijtvliet avatar Feb 17 '22 10:02 rwijtvliet

(Appending old_freq is not an option when indexing from a DatetimeInterval.)

Can you clarify what you mean here? There is no DatetimeInterval.

jbrockmendel avatar Feb 18 '22 00:02 jbrockmendel

(Appending old_freq is not an option when indexing from a DatetimeInterval.)

Can you clarify what you mean here? There is no DatetimeInterval.

Ah, @jbrockmendel , I meant DatetimeIndex; I edited the original post.

Here's a concrete example of one of the difficulties I see with the "add .old_freq to a timestamp" suggestion. I find it hard to be concise, sorry about that.

import pandas as pd

# Example: running costs of a machine that is always on.
idx = pd.date_range("2022", freq="MS", periods=12, tz="Europe/Berlin")
costs = pd.DataFrame({"USD_per_h": 1.0}, idx)  # fixed running costs for sake of example

# The total costs can be calculated with this function.
def costs_USD(ts, costs_USD_per_h):
    timedelta = (ts + ts.freq) - ts
    hours = timedelta.total_seconds() / 3600
    return costs_USD_per_h * hours


# (A) As long as I'm working with entire series/dataframes, this works, as the `freq` attribute of the index is available, also in the future:
costs["USD"] = costs_USD(costs.index, costs.USD_per_h)

# Note this is correct, with odd number in March and Oct which have DST-change
# costs
# 	                    USD_per_h    USD
# 2022-01-01 00:00:00+01:00       1.0  744.0
# 2022-02-01 00:00:00+01:00       1.0  672.0
# 2022-03-01 00:00:00+01:00       1.0  743.0
# 2022-04-01 00:00:00+02:00       1.0  720.0
# 2022-05-01 00:00:00+02:00       1.0  744.0
# 2022-06-01 00:00:00+02:00       1.0  720.0
# 2022-07-01 00:00:00+02:00       1.0  744.0
# 2022-08-01 00:00:00+02:00       1.0  744.0
# 2022-09-01 00:00:00+02:00       1.0  720.0
# 2022-10-01 00:00:00+02:00       1.0  745.0
# 2022-11-01 00:00:00+01:00       1.0  720.0
# 2022-12-01 00:00:00+01:00       1.0  744.0

# (B) But, I might deal with a single timestamp and single value; and a 'lone timestamp' will be passed to the function:
costs_in_january = costs_USD(costs.index[0], costs.USD_per_h[0]) # FutureWarning

# (C) Or, I (for whatever reason) might want/have to apply a function row-wise; here too a 'lone timestamp' is passed:
costs['USD'] = costs.apply(lambda row: costs_USD(row.name, row.USD_per_h), axis=1)

Surely, there are workarounds once the freq attribute is lost, but they are all cumbersome and remove flexibility.

My point concerning your suggestion in particular is that, yes, I could add old_freq as an attribute to a timestamp. I could then change the costs_USD function to something like this:

def costs_USD2(ts, costs_USD_per_h):
    if hasattr(ts, "old_freq"):  # or catch AttributeError
        freq = ts.old_freq
    else:
        freq = ts.freq
    timedelta = (ts + freq) - ts
    hours = timedelta.total_seconds() / 3600
    return costs_USD_per_h * hours

And then (A) would still work, and for (B) I could do some prep-work before calling:

ts = costs.index[0]
ts.old_freq = costs.index.freq
costs_in_january = costs_USD2(ts, costs.USD_per_h[0])  # No FutureWarning

However, (C) would not work anymore, and in general, I'd need to remember to add the old_freq attribute whenever I store an index value to a separate variable or deal with individual timestamps.

rwijtvliet avatar Feb 18 '22 12:02 rwijtvliet

@rwijtvliet i appreciate the discussion but freq is being removed partially because of the added complexity

i don't know what to tell you except try interval index

jreback avatar Feb 18 '22 12:02 jreback

Yes, I understand and appreciate that, @jreback. Having a freq attribute for an individual timestamp makes no sense, so it makes sense to deprecate it. And as I wrote, I am aware that I'm using timestamps for something they are not meant for - namely, for time periods - and I'd be more than willing to refactor and use something that is better suited.

However, there is no alternative available that is better suited.

  • The Period class does not (and will not) have a notion of a timezone. It has the incorrect timedelta in some months (like March and Oct in my example), and resampling a Period object representing a year, into hours, returns a PeriodIndex that contains a non-existing hour (the 3rd hour of the final Sunday in March) and misses an existing hour (the 4th hour of the final Sunday in October). This was the entire point of me opening this thread.

  • The Interval object has no notion of freq (or any way to answer the question "would we call this a month, a year, a day, a quarter, or an hour"), and cannot be resampled, just to name a major issue.

I guess what surprises me the most is that, from your comments, I'm apparently the only person that needs this functionality. But if that's how it is, and you also see no options for me, I'll have to get creative and add workarounds. I just wanted to make sure that that is really how it is, before I start - because I really doubt it.

Thanks for your time and input; if you have any questions about my use case or if I can help clarify things, feel free to contact me directly or in this thread.

rwijtvliet avatar Feb 18 '22 13:02 rwijtvliet

If pinning an old_freq attr isn't viable, can you make the timezones unnecessary? Could tz_convert to UTC, then do everything with Periods.

jbrockmendel avatar Feb 18 '22 21:02 jbrockmendel

Sorry didn't mean to close

rwijtvliet avatar Feb 19 '22 08:02 rwijtvliet

(damn I shouldn't be doing this on my phone 😬)

Thanks @jbrockmendel for your continued constructive input.

I thought about that as well, but I see many problems there, too. E.g.

  • can't aggregate (downsample) because incorrect periods are grouped;
  • local time is important. Something that happens 12:00-13:00 every day is shifted by one hour during half of the year when viewed in UTC;
  • also, duration is important. Just like it's an unacceptable simplification to treat every month as having 30 days, so is it (in my use case) to treat every day as having 24 hours.

rwijtvliet avatar Feb 19 '22 08:02 rwijtvliet

If I were interested in trying to write up this functionality for a timezone-aware Period (let's call it a Zeriod unless someone has a better idea), and the associated Index (ZeriodIndex) myself, how would I go about it?

I'm thinking about wrappers around the Timestamp and DatetimeIndex classes, with some restrictions on the allowed frequencies and custom attributes.

The ZeriodIndex should be usable as index for Series and DataFrame instances, and these objects should be .resampleable and .groupbyable, just as when they have e g. a DatetimeIndex.

It's probably a lot more involved than I'm naively thinking, and it'll be a while before I could actually start on it. But that gives us some time to discuss if it's possible and worth doing - I would appreciate some transparent opinions and open discussion on this.

rwijtvliet avatar Feb 20 '22 00:02 rwijtvliet

discuss if it's possible and worth doing

Surely possible, but daunting.

Going back to the old_freq idea, what if you just patched DatetimeIndex's __getitem__ and __iter__ to do:

if not hasattr(ts, "freq"):  # i.e. deprecation has been enforced
    ts.freq = self.freq
return ts 

jbrockmendel avatar Feb 20 '22 21:02 jbrockmendel

That's an interesting suggestion that I hadn't thought about. This does indeed work:

import pandas as pd

def apply_wrapper(Index):
    if hasattr(Index, '_iswrapped'):
        return # Avoid wrapping multiple times
    
    getitem_original = Index.__getitem__
    def getitem_wrapped(self, item):
        ts = getitem_original(self, item)
        if not hasattr(ts, 'freq'):
            ts.freq = self.freq
        return ts    
    Index.__getitem__ = getitem_wrapped

    # ToDo: add similar thing for `__iter__`

    Index._iswrapped = True

apply_wrapper(pd.DatetimeIndex)

I still want to make that Zeriod class, but I'm glad that there is a work-around that makes this less urgent. Many thanks @jbrockmendel !

rwijtvliet avatar Feb 23 '22 12:02 rwijtvliet

If we ever decide to implement a timezone-aware period, I think the way to go is to inherit from pandas.Interval and add implementations for all the various named 'freq' durations. Perhaps it even makes sense to have separate inherited classes for each 'freq', making the new periods open to extension.

Relying on pandas.Interval has the advantage that this class already supports calculating duration (i.e. as a timedelta), checking if a time falls inside, as well as various ways to deal with open/closed intervals.

joooeey avatar Aug 05 '22 12:08 joooeey