pandas
pandas copied to clipboard
ENH: timezone-aware `Period`
Is your feature request related to a problem?
A Timestamp
with a freq
attribute can be used as a replacement/workaround for a Period
, if time-zone awareness is needed. Below is an example.
With the deprecation of the .freq
attribute for individual timestamps, this workaround will stop working in a future release.
Describe the solution you'd like
Implementation of timezone-awareness in the Period
class.
API breaking implications
Period
and PeriodIndex
initializers take an additional tz
parameter.
Describe alternatives you've considered
The alternative is what I'm currently using (Timestamp.freq
) but will be deprecated.
Additional context
Current implementation of this functionality:
import pandas as pd
def ts_right(ts_left: pd.Timestamp) -> pd.Timestamp:
"""Right-bound timestamp associated with a (left-bound) timestamp."""
if ts_left.freq == "D":
kwargs = {"days": 1}
elif ts_left.freq == "MS":
kwargs = {"months": 1}
elif ts_left.freq == "QS":
kwargs = {"months": 3}
elif ts_left.freq == "AS":
kwargs = {"years": 1}
else:
raise ValueError(f"Invalid frequency: {ts_left.freq}.")
return ts_left + pd.DateOffset(**kwargs)
def duration(ts_left: pd.Timestamp) -> float:
"""Duration [h] associated with a timestamp."""
return (ts_right(ts_left) - ts_left).total_seconds() / 3600
ts1 = pd.Timestamp("2022-03-01", tz="Europe/Berlin", freq="D") # "Timestamp.freq is deprecated and will be removed in a future version"
ts_right(ts1) # Timestamp('2022-03-02 00:00:00+0100', tz='Europe/Berlin')
duration(ts1) # 24.0
ts2 = pd.Timestamp("2022-03-27", tz="Europe/Berlin", freq="D") # "Timestamp.freq is deprecated and will be removed in a future version"
ts_right(ts2) # Timestamp('2022-03-28 00:00:00+0200', tz='Europe/Berlin')
duration(ts2) # 23.0
Is there a reason you can't pass freq as a separate argument to ts_right?
Thanks @jbrockmendel. The Timestamp
instances are used throughout my code to indicate periods of various natural lengths - an hour, a day, a month, etc. Without the freq
attribute, I'd need to start passing two pieces of information around, or think of a replacement objects, e.g. (Timestamp
, freq
)-tuples, or custom dataclass
instances.
In addition, the integration with the pandas.DatetimeIndex
would be lost. Currently, if a DatetimeIndex
has the .freq
attribute set, each element is a pandas.Timestamp
instance with the same .freq
attribute. I can't imagine how I'll deal with the situations where I currently take a single element from such an index, which will in the future lose that .freq
attribute.
(Including where this is done under the hood, e.g. when selecting a single Series
entry or DataFrame
row with .loc
or iloc
. etc.)
Don't get me wrong, I agree that timestamps denote infinitely short moments in time and have no 'duration' associated with them. But it's a workaround until Period
s become timezone-aware - and the mentioned deprecation puts me in a bit of a bind.
Semantically Periods are spans of time (akin to timedeltas) which are supposed to be agnostic to timezone arithmetic in the first place, so I don't think it makes much sense to add timezone to a period
Thanks for your reply @mroeschke. Could you please expand on that? I've seen similar comments before, but I never understand the reasoning behind them.
The analogy to timestamps is very clear to me:
- A timestamp can be agnostic regarding UTC-offset, but, in order to unambiguously specify the moment that is meant, a UTC-offset must be included.
2022-02-11T15:00:00
(i.e., UTC-offset-agnostic) only works if the use-case implies a single non-changing timezone. If you tell me: "meet me (online) at2022-02-11T15:00:00
", it will not work unless you also tell me the UTC-offset. I think we all agree on this. - Analogously: a time period can be timezone-agnostic, but, in order to unambiguously specify the time span that is meant, a timezone must be included. The calendar day
2022-02-11
(i.e., timezone-agnostic) only works if the use-case implies a single non-changing timezone. If you tell me: "send me a message sometime during the calendar day 2022-02-11 (and not on the day before or the day after)" I can't tell when that period is exactly, unless you also tell me its timezone.
The natural relation between periods and timestamps is that some timestamps fall inside the period, and other do not. If the timestamp is UTC-offset-agnostic, I can verify if it falls inside a timezone-agnostic period. Likewise, I should be able to verify if a timestamp with a UTC-offset falls inside a timezone-aware period. I.e., the timestamp 2022-02-11T01:00:00+04:00
falls within the calendar day 2022-02-11
of the Aukland timezone, but not of the New York timezone.
Now, that is not to say that I don't see that timezone-awareness in periods is more complex than in timestamps. Here are some of the difficulties:
-
I can convert a timestamp from one UTC-offset to another. The same is not possible for periods. The calendar day
2022-03-27
in the Aukland timezone cannot be converted to an equally natural period in the New York timezone, simply because it cuts across 2 calendar days in that timezone. However, this is not a bug, but a feature: it's simply the reality of the situation. -
A timestamp belongs to exactly one timezone, expressed as offset from UTC, like
+04:00
in my previous post. Periods can start with a timestamp having one offset, and end with a timestamp having another. For example, the2022-03-27
in theEurope/Berlin
timezone starts at2022-03-27T00:00:00+01:00
and ends at2022-03-28T00:00:00+02:00
, and therefore only spans 23 * 3600 seconds, rather than the usual 24 * 3600 seconds. This too, however, is the reality of the situation, and not a bug but a feature.
NB: The latter example is what I'm currently using timezone-aware timestamps with frequency for. start = pd.Timestamp('2022-03-27 00:00', tz='Europe/Berlin', freq='D')
gives me the correct start timestamp (2022-03-27 00:00:00+0100'
) and start + 1 * start.freq
gives the correct end timestamp (2022-03-28 01:00:00+0200
), note the difference in UTC offsets.
I hope I've been able to make my case :) But please share your perspectives and/or try to convince me otherwise.
One final thought. I don't want to go into possible implementation details here, but I think it's important to realize that a timezone can be expressed in two ways: as an offset from UTC (e.g., '+01:00'), or as tied to a geographic location (e.g., 'Europe/Berlin').
-
The only way that makes sense for timezone-aware timestamps is the offset from UTC. Using the location can be ambiguous:
2022-10-30 02:30 (Europe/Berlin)
can be either+01:00
or+02:00
due to the DST changeover -
The only way that makes sense for timezone-aware periods is the geographic location, as the start and end timestamps might have distinct UTC-offsets, as in the example in my previous post.
a period could have time zone aware end points but i don't see a good reason for a time zone aware duration (not to mention a massive increase in complexity)
same reason we are killing the freq attribute in a Timestamp which is an instant
If the end points are timezone-aware, the duration (which I understand to be the timedelta
duration = end-start
) is already taken care of.
exactly so why would you then need a time zone if a Period? or rather what does it mean at all? what if u have naive endpoints? what if it had time zone aware endpoints?
this is such a huge ambiguity that it is unworkable
Let me try to make it more clear:
-
Currently implemented: Timezone-agnostic periods which have UTC-offset-agnostic start and end timestamps. E.g. the calendar day
2022-02-11
which is the right-open interval between timestamp2022-02-11T00:00:00
(incl) and timestamp2022-02-12T00:00:00
(excl).- Calculating the duration as the difference between the end and the start gives 24 * 3600 seconds.
- This calculation always gives 24 * 3600 seconds, whichever day is taken.
-
My proposal: timezone-aware periods which have UTC-offset-aware start and end timestamps. E.g., the calendar day
2022-02-11 Europe/Berlin
is the right-open interval between timestamp2022-02-11T00:00:00+01:00
(incl) and timestamp2022-02-12T00:00:00+01:00
(excl). Note that the period contains the geographic location and the timestamps contain the UTC offset.- Calculating the duration as the difference between the end and the start also gives 24 * 3600 seconds.
- But using the example given before: the calendar day
2022-03-27 Europe/Berlin
is from midnight in UTC+01:00
to midnight in UTC+02:00
, and therefore gives a duration of 23 * 3600 seconds.
Could you please elaborate, where you think the ambiguity comes in?
If it's not clear, why such a timezone-aware period might be needed: it is needed whenever the actual duration is relevant in order to aggregate values. I will give some use-cases that I can think of:
- A power plant produces a constant 100 MW of power throughout the year. We need to calculate, how much energy that is for each calendar day (or worse, calendar month) by multiplying the 100 MW with the number of seconds in each period.
- Renting a device at a per-hour price of 10 USD/h. Renting the device for a calendar day costs me 240 USD on most days, but 230 USD or 250 USD if there is a summertime/wintertime changeover.
- A worker has flexible working hours, and is paid a bonus based on the average number of gadget he produces per hour. On a day that has only 23 hours, he should not be expected to have the same output as on a day that has 24 hours.
Note that this is not much different from the case where months have variable lengths. Just that the variation comes from another source, namely that a timezone (Europe/Berlin
) switches UTC offset. But these moments are well-known.
@rwijtvliet you are missing the point
having start/end points as tz aware is possible ; these describe instants of time.
but have a tz attribute on a period itself doesn't make sense; it is literally a duration / time delta, which doesn't have a notion of a time zone
geographic location is also not relevant (well it maybe relevant to you but not to a generic concept; and out of scope)
having start/end points as tz aware is possible ; these describe instants of time
@jreback can you show me an example of such a period with a tz-aware start/end point?
having start/end points as tz aware is possible ; these describe instants of time
@jreback can you show me an example of such a period with a tz-aware start/end point?
These dont' exist but I suppose could. These would do almost exactly what you want and not break the idiom
Do you have thoughts on how they would they be constructed or used? I can currently do something like p = pd.Period('2022-03-27')
, and pandas
correctly assumes I want a period with a duration of a day. I don't see (yet) how I could then set the timezone of each end point.
But if there is indeed a way to implement that for Period
and PeriodIndex
instances, while retaining their freq
attribute, that would solve my problems and I could move away from the Timestamp + freq
workaround I currently have.
If the timezone is important for your application, since it sounds you are really interested in a start date w/ timezone + a duration as a "period", I think you may want to use Interval / IntervalIndex instead
@rwijtvliet in the not-too-distant future we'll have Timestamps that have resolution other than nanosecond, e.g. second or millisecond, and those will continue to have timezone support. That is a bit like what you are suggesting, but won't go all the way up to e.g. MonthStart.
For your use case, could you just pin the freq to the Timestamp object under a new name? ts.old_freq = whatever
As reference, R's lubridate
package (modern date/times) has a interval
object which is somewhat their equivalent of Period (they also have a period
, but that's more like a timedelta, or rather dateutil's relativedelta), see https://lubridate.tidyverse.org/articles/lubridate.html#time-intervals. This works with tz-aware start / end timestamps (and the same is true for example for Java's jodatime).
However, lubridate's interval
is actually stored as start/end timestamps, rather than the freq+ordinal as our Periods. And thus storage-wise is like our pd.Interval. As @mroeschke mentions, that might also something to check if that could work for your application, and intervals already supports tz-aware left/right values (although it has for the rest no period/timestamp specific functionality).
I think it's very natural to want to have Periods that are (in some way) timezone-aware. Thanks @rwijtvliet for making the case.
I only don't know if the Period
class as it is currently implemented is suited to add something like that (or whether it would be worth the complexity).
Ok thanks guys for all your replies.
My current takeaway is
-
Timestamp
andDatetimeIndex
can handle UTC-offsets, but describe instants in time. They shouldn't be used to describe intervals in time (no discussion there), and I don't see a way to keep using them for that oncefreq
is deprecated. (Appendingold_freq
is not an option when indexing from aDatetimeIndex
.) -
Period
andPeriodIndex
are intervals in time, but cannot handle timezones, and are therefore not the correct class for my use-case either. Timezone-awareness is beyond their scope and will not be implemented.
I had a quick look at Interval
and IntervalIndex
to see if they might be the thing I should use:
>>> i = pd.date_range('2022', freq='MS', periods=5, tz='Europe/Berlin')
>>> idx = pd.IntervalIndex.from_arrays(i, i + i.freq, closed='left')
>>> idx[2].left, idx[2].right # so far, so good
Timestamp('2022-03-01 00:00:00+0100', tz='Europe/Berlin', freq='MS'), Timestamp('2022-04-01 00:00:00+0200', tz='Europe/Berlin') # (the 'freq' attribute will deprecate shortly)
>>> s = pd.Series([1,2,30,400,-500], idx)
Some quick questions for me to judge their usefulness:
-
Can I resample
s
to daily or quarterly values? -
Is there a way to get, from
idx
and any of itsInterval
elements, the information that each interval is a calendar month? Other than passing it through a longif elif elif
block, like so:
def freq(i:pd.Interval):
if pd.Timedelta(hours=23) <= i.length <= pd.Timedelta(hours=25): # got to consider variable day length due to DST-changeover
return pd.offsets.Day()
elif pd.Timedelta(days=27, hours=23) <= i.length <= pd.Timedelta(days=31, hours=1): # got to consider variable number of days and their variable length due to DST-changeover
if i.left.day == 1 and i.left == i.left.floor('D'):
return pd.offsets.MonthBegin()
# ... and many other checks for various edge cases and particulars of time intervals based on human calendars.
If I'm honest: even if I get this to work, and I find a way to use Interval
and IntervalIndex
to represent timezone-aware time periods... i'm afraid it will be just as bad a workaround (to Period
and PeriodIndex
not being timezone-aware) as using Timestamp
and DatetimeIndex
is.
(Appending old_freq is not an option when indexing from a DatetimeInterval.)
Can you clarify what you mean here? There is no DatetimeInterval.
(Appending old_freq is not an option when indexing from a DatetimeInterval.)
Can you clarify what you mean here? There is no DatetimeInterval.
Ah, @jbrockmendel , I meant DatetimeIndex
; I edited the original post.
Here's a concrete example of one of the difficulties I see with the "add .old_freq
to a timestamp" suggestion. I find it hard to be concise, sorry about that.
import pandas as pd
# Example: running costs of a machine that is always on.
idx = pd.date_range("2022", freq="MS", periods=12, tz="Europe/Berlin")
costs = pd.DataFrame({"USD_per_h": 1.0}, idx) # fixed running costs for sake of example
# The total costs can be calculated with this function.
def costs_USD(ts, costs_USD_per_h):
timedelta = (ts + ts.freq) - ts
hours = timedelta.total_seconds() / 3600
return costs_USD_per_h * hours
# (A) As long as I'm working with entire series/dataframes, this works, as the `freq` attribute of the index is available, also in the future:
costs["USD"] = costs_USD(costs.index, costs.USD_per_h)
# Note this is correct, with odd number in March and Oct which have DST-change
# costs
# USD_per_h USD
# 2022-01-01 00:00:00+01:00 1.0 744.0
# 2022-02-01 00:00:00+01:00 1.0 672.0
# 2022-03-01 00:00:00+01:00 1.0 743.0
# 2022-04-01 00:00:00+02:00 1.0 720.0
# 2022-05-01 00:00:00+02:00 1.0 744.0
# 2022-06-01 00:00:00+02:00 1.0 720.0
# 2022-07-01 00:00:00+02:00 1.0 744.0
# 2022-08-01 00:00:00+02:00 1.0 744.0
# 2022-09-01 00:00:00+02:00 1.0 720.0
# 2022-10-01 00:00:00+02:00 1.0 745.0
# 2022-11-01 00:00:00+01:00 1.0 720.0
# 2022-12-01 00:00:00+01:00 1.0 744.0
# (B) But, I might deal with a single timestamp and single value; and a 'lone timestamp' will be passed to the function:
costs_in_january = costs_USD(costs.index[0], costs.USD_per_h[0]) # FutureWarning
# (C) Or, I (for whatever reason) might want/have to apply a function row-wise; here too a 'lone timestamp' is passed:
costs['USD'] = costs.apply(lambda row: costs_USD(row.name, row.USD_per_h), axis=1)
Surely, there are workarounds once the freq
attribute is lost, but they are all cumbersome and remove flexibility.
My point concerning your suggestion in particular is that, yes, I could add old_freq
as an attribute to a timestamp. I could then change the costs_USD
function to something like this:
def costs_USD2(ts, costs_USD_per_h):
if hasattr(ts, "old_freq"): # or catch AttributeError
freq = ts.old_freq
else:
freq = ts.freq
timedelta = (ts + freq) - ts
hours = timedelta.total_seconds() / 3600
return costs_USD_per_h * hours
And then (A) would still work, and for (B) I could do some prep-work before calling:
ts = costs.index[0]
ts.old_freq = costs.index.freq
costs_in_january = costs_USD2(ts, costs.USD_per_h[0]) # No FutureWarning
However, (C) would not work anymore, and in general, I'd need to remember to add the old_freq
attribute whenever I store an index value to a separate variable or deal with individual timestamps.
@rwijtvliet i appreciate the discussion but freq is being removed partially because of the added complexity
i don't know what to tell you except try interval index
Yes, I understand and appreciate that, @jreback. Having a freq
attribute for an individual timestamp makes no sense, so it makes sense to deprecate it. And as I wrote, I am aware that I'm using timestamps for something they are not meant for - namely, for time periods - and I'd be more than willing to refactor and use something that is better suited.
However, there is no alternative available that is better suited.
-
The
Period
class does not (and will not) have a notion of a timezone. It has the incorrect timedelta in some months (like March and Oct in my example), and resampling aPeriod
object representing a year, into hours, returns aPeriodIndex
that contains a non-existing hour (the 3rd hour of the final Sunday in March) and misses an existing hour (the 4th hour of the final Sunday in October). This was the entire point of me opening this thread. -
The
Interval
object has no notion offreq
(or any way to answer the question "would we call this a month, a year, a day, a quarter, or an hour"), and cannot be resampled, just to name a major issue.
I guess what surprises me the most is that, from your comments, I'm apparently the only person that needs this functionality. But if that's how it is, and you also see no options for me, I'll have to get creative and add workarounds. I just wanted to make sure that that is really how it is, before I start - because I really doubt it.
Thanks for your time and input; if you have any questions about my use case or if I can help clarify things, feel free to contact me directly or in this thread.
If pinning an old_freq attr isn't viable, can you make the timezones unnecessary? Could tz_convert to UTC, then do everything with Periods.
Sorry didn't mean to close
(damn I shouldn't be doing this on my phone 😬)
Thanks @jbrockmendel for your continued constructive input.
I thought about that as well, but I see many problems there, too. E.g.
- can't aggregate (downsample) because incorrect periods are grouped;
- local time is important. Something that happens 12:00-13:00 every day is shifted by one hour during half of the year when viewed in UTC;
- also, duration is important. Just like it's an unacceptable simplification to treat every month as having 30 days, so is it (in my use case) to treat every day as having 24 hours.
If I were interested in trying to write up this functionality for a timezone-aware Period
(let's call it a Zeriod
unless someone has a better idea), and the associated Index (ZeriodIndex
) myself, how would I go about it?
I'm thinking about wrappers around the Timestamp
and DatetimeIndex
classes, with some restrictions on the allowed frequencies and custom attributes.
The ZeriodIndex
should be usable as index for Series
and DataFrame
instances, and these objects should be .resample
able and .groupby
able, just as when they have e g. a DatetimeIndex
.
It's probably a lot more involved than I'm naively thinking, and it'll be a while before I could actually start on it. But that gives us some time to discuss if it's possible and worth doing - I would appreciate some transparent opinions and open discussion on this.
discuss if it's possible and worth doing
Surely possible, but daunting.
Going back to the old_freq idea, what if you just patched DatetimeIndex's __getitem__
and __iter__
to do:
if not hasattr(ts, "freq"): # i.e. deprecation has been enforced
ts.freq = self.freq
return ts
That's an interesting suggestion that I hadn't thought about. This does indeed work:
import pandas as pd
def apply_wrapper(Index):
if hasattr(Index, '_iswrapped'):
return # Avoid wrapping multiple times
getitem_original = Index.__getitem__
def getitem_wrapped(self, item):
ts = getitem_original(self, item)
if not hasattr(ts, 'freq'):
ts.freq = self.freq
return ts
Index.__getitem__ = getitem_wrapped
# ToDo: add similar thing for `__iter__`
Index._iswrapped = True
apply_wrapper(pd.DatetimeIndex)
I still want to make that Zeriod
class, but I'm glad that there is a work-around that makes this less urgent. Many thanks @jbrockmendel !
If we ever decide to implement a timezone-aware period, I think the way to go is to inherit from pandas.Interval
and add implementations for all the various named 'freq' durations. Perhaps it even makes sense to have separate inherited classes for each 'freq', making the new periods open to extension.
Relying on pandas.Interval
has the advantage that this class already supports calculating duration (i.e. as a timedelta), checking if a time falls inside, as well as various ways to deal with open/closed intervals.