polars
polars copied to clipboard
Business date/calendar logic
Problem description
It would nice to natively support certain business date operations. In particular, I envision polars expression counterparts versions of the vectorized numpy busday_offset, is_busday, busday_count functions. The numpy versions of these functions accept weekmask and holidays arguments (or a busdaycalendar object, which just stores a weekmask and list of holidays).
Given a dataframe with a date column, you could do something like
weekmask = "1111100"
holidays = [datetime.date(2000, 1, 1)]
df.with_columns(
[
pl("date").busday_offset(3, weekmask=weekmask, holidays=holidays),
pl("date").is_busday(weekmask=weekmask, holidays=holidays),
pl.busday_count(
pl.col("date"),
datetime.date(2000, 1, 2),
weekmask=weekmask,
holidays=holidays
)
]
)
It would also be nice to have a bdate_range function/expression so you could get all business dates between two columns by
df.select(
pl.bdate_range(pl.col("start_date"), pl.col("end_date"), weekmask=weekmask, holidays=holidays)
)
Related issue, but specific for upsampling: #5516.
+1 to have these features.
Also to note that this answer in this stackoverflow answer does not seem to do the right thing either for np.busday_count.
import polars as pl
import numpy as np
df = pl.DataFrame(
{
"Day1": [
"2022-01-02",
"2022-01-03",
"2022-01-04",
],
"Day2": [
"2022-01-03",
"2022-01-04",
"2022-01-05",
],
}
).with_columns(pl.col(["Day1", "Day2"]).str.strptime(pl.Date, "%Y-%m-%d"))
print(
(
df.with_columns(
pl.struct([pl.col("Day1"), pl.col("Day2")])
.map(
lambda x: np.busday_count(
x.struct["Day1"], x.struct["Day2"], weekmask="1110000"
)
)
.alias("Result")
)
)
)
shape: (6, 5)
┌────────┬──────────┬────────────┬────────────┬─────────────┐
│ Market ┆ Service ┆ Day1 ┆ Day2 ┆ Result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ date ┆ date ┆ list[i64] │
╞════════╪══════════╪════════════╪════════════╪═════════════╡
│ AT ┆ Standard ┆ 2022-01-02 ┆ 2022-01-03 ┆ [0, 1, … 0] │
│ DE ┆ Express ┆ 2022-01-03 ┆ 2022-01-04 ┆ [0, 1, … 0] │
│ AT ┆ Standard ┆ 2022-01-04 ┆ 2022-01-05 ┆ [0, 1, … 0] │
│ CZ ┆ Standard ┆ 2022-01-05 ┆ 2022-01-06 ┆ [0, 1, … 0] │
│ GB ┆ Standard ┆ 2022-01-06 ┆ 2022-01-07 ┆ [0, 1, … 0] │
│ CZ ┆ Standard ┆ 2022-01-07 ┆ 2022-01-08 ┆ [0, 1, … 0] │
└────────┴──────────┴────────────┴────────────┴─────────────┘
And passing the expressions directly into the numpy functions (e.g. np.busday_count) does not work either.
df.select(np.busday_count(pl.col("Day1"), pl.col("Day2"), weekmask="1110000"))
ValueError: Could not convert object to NumPy datetime
The error you are seeing is a Numpy limitation, busday_count is not a ufunc, see the ufunc list of functions.
Also, right now we do not support ufunc's with more than one expression. The recommendation is to use pl.reduce instead, although the naive way generates a single row list[i64] dataframe (despite the docs claiming it works horizontally):
>>> df.select(pl.reduce(lambda dt1, dt2: np.busday_count(dt1, dt2, weekmask="1110000"), [pl.col('Day1'), pl.col('Day2')]))
shape: (1, 1)
┌───────────┐
│ Day1 │
│ --- │
│ list[i64] │
╞═══════════╡
│ [0, 1, 1] │
└───────────┘
Wrapping in a Series solves this:
>>> df.select(pl.reduce(lambda dt1, dt2: pl.Series(np.busday_count(dt1, dt2, weekmask="1110000")), [pl.col('Day1'), pl.col('Day2')]))
shape: (3, 1)
┌──────┐
│ Day1 │
│ --- │
│ i64 │
╞══════╡
│ 0 │
│ 1 │
│ 1 │
└──────┘
@zundertj Thanks for the suggestion! That's a good workaround until there is native support for business date operations.
I would also expect some support with aliases
In Expr.dt.offset_by we could then use 1b for 1 business day offset or 1bh for a business hour. Such as in pandas. https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
Starting with adding '1bd' and '1bh' as offset aliases sounds good! weekmask and holidays can come after
Anyone here interested in giving this one a go?
EDIT: this may require more discussion https://github.com/pola-rs/polars/issues/11568#issuecomment-1750969803
taking this forwards in https://github.com/MarcoGorelli/polars-business , let's take requests and ideas over there
@erinov1 just FYI, all the requests from your example are now available:
import polars as pl
import polars_business as plb
from datetime import date
weekend = ["Sat", "Sun"]
holidays = [date(2000, 1, 1)]
df = pl.DataFrame(
{
"start_date": [date(2000, 3, 1), date(2000, 4, 3)],
"end_date": [date(2000, 3, 3), date(2000, 4, 19)],
}
)
print(
df.with_columns(
start_plus_3bd=plb.col("start_date").bdt.offset_by(
"3bd", weekend=weekend, holidays=holidays
),
start_is_workday=plb.col("start_date").bdt.is_workday(
weekend=weekend, holidays=holidays
),
workday_count=plb.workday_count(
"start_date",
"end_date",
weekend=weekend,
holidays=holidays,
),
)
)
shape: (2, 5)
┌────────────┬────────────┬────────────────┬──────────────────┬───────────────┐
│ start_date ┆ end_date ┆ start_plus_3bd ┆ start_is_workday ┆ workday_count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ date ┆ date ┆ bool ┆ i32 │
╞════════════╪════════════╪════════════════╪══════════════════╪═══════════════╡
│ 2000-03-01 ┆ 2000-03-03 ┆ 2000-03-06 ┆ true ┆ 2 │
│ 2000-04-03 ┆ 2000-04-19 ┆ 2000-04-06 ┆ true ┆ 12 │
└────────────┴────────────┴────────────────┴──────────────────┴───────────────┘