polars
polars copied to clipboard
Different output between Polars and Pandas using date_range with interval="1mo"
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
Hi, I am trying to create a dataframe with date_range and an interval "1mo" but after the month of February, the last day for all the months is the same, the 29th, instead of the one that would correspond to each month. I have tried the same with Pandas and here the result is correct.

Reproducible example
import polars as pl
from datetime import date
start = date(2020,1,31)
stop = date(2021,1,31)
df_pl = (
pl.DataFrame(
{
"date":pl.date_range(start,stop,interval="1mo")
}
)
)
Expected behavior
date
2020-01-31
2020-02-29
2020-03-31
2020-04-30
2020-05-31
...
This Pandas code generate the correct output:
import pandas as pd
from datetime import date
start = date(2020,1,31)
stop = date(2021,1,31)
df_pd =(
pd.DataFrame(
{
"date":pd.date_range(start, stop, freq='M')
}
)
)
Installed versions
---Version info---
Polars: 0.14.26
Index type: UInt32
Platform: macOS-12.4-x86_64-i386-64bit
Python: 3.9.7 (default, Sep 1 2021, 18:01:29)
[Clang 12.0.0 (clang-1200.0.32.2)]
---Optional dependencies---
pyarrow: 10.0.0
pandas: 1.5.1
numpy: 1.23.4
fsspec: <not installed>
connectorx: 0.3.1
xlsx2csv: 0.8
matplotlib: <not installed>
This is where dates get annoying/weird. Why oh why did they give each month a different number of days? What is one month after February 28th? March 28th or March 31st? Is it the same thing as one month after February 29th? If the former, what is one month before March 31st?
We have ill-defined concepts of months in our heads that don't have a great transformation. I think the idea that's in most peoples' heads is covered by these two rules:
- If the start day of a sequence is the last day of the month, then the user intends for
1moto respect the "last day of the month" rule, regardless of how many days are in the current month. So if the start date is2020-02-29, then the next item would be2020-03-31. If the start date is2021-02-28, the next item would be2021-03-31. - If the start day of a sequence is any other number, then
1momeans "the same day of the month next month, unless that number doesn't fit into the month, in which case the end of the month" So for example, if your start date is2022-01-30, then we'd get2022-02-29,2022-03-30, etc.
I agree. But all the references that I have found to "1mo" refer to calendar month. For example: https://docs.rs/polars/latest/polars/prelude/trait.PolarsUpsample.html
@tomaslucas the point is that "calendar month" is not well-defined duration. The same is true for calendar years due to leap years/seconds but it's generally less of an issue.
Still, I agree that the behaviour of Polars is peculiar. It's not what I would expect. I think we can do better here, although as you say there might not be an obvious best solution.
What would make sense to me is to simply increment the month, and keep the day the same, bounded by the maximum day available in that month. Then you'll get the pandas output.
I think a special eom function would work. Alternatively, you could do a series of day-1 dates by month, and then subtract one day, to get the end of month.
I think it's important to point out that pd.date_range(date(2020,1,15), date(2020,6,15), freq="M") returns
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30', '2020-05-31'], dtype='datetime64[ns]', freq='M')
so while that output is great if your input parameters are the last days of the month, it, in my opinion, loses its crown for intuitive output quite easily.
If what you want is the last day of every month then probably better to just start big and then filter down.
ie.
df_pl = (
pl.DataFrame(
{
"date":pl.date_range(date(2020,1,31),date(2021,1,31),interval="1d")
}
)
).filter((pl.col('date')+pl.duration(days=1)).dt.month()!=pl.col('date').dt.month())
First off, in pandas, 'M' (confusingly) means 'month end ', so that's why it always goes to the last day of the month.
If you had
In [3]: import pandas as pd
...: from datetime import date
...:
...: start = date(2020,1,15)
...: stop = date(2021,1,15)
...: df_pd =(
...: pd.DataFrame(
...: {
...: "date":pd.date_range(start, stop, freq='M')
...: }
...: )
...: )
then you'd also get
In [4]: df_pd
Out[4]:
date
0 2020-01-31
1 2020-02-29
2 2020-03-31
3 2020-04-30
4 2020-05-31
5 2020-06-30
6 2020-07-31
7 2020-08-31
8 2020-09-30
9 2020-10-31
10 2020-11-30
11 2020-12-31
, which I'd bet is unexpected to most readers. There is an issue in pandas about renaming 'M' to 'ME', but I'm just explaining why this snippet appears to magically just work in pandas (when in reality 'M' doesn't mean what people think it does).
Back to the problem at hand. In this case, you can achieve your desired output with just a single line of code:
pl.date_range(date(2020, 1, 1), date(2021, 2, 1), '1mo', eager=True).dt.month_end()
To avoid surprises, how about raising if adding a month offset can't keep the same day-of-the-month? This way:
- if someone wanted the same day of the month, then that's impossible, so better to have the code explode in their face straight away
- if someone wanted the last day of the month, then this wouldn't give it to them anyway, and it would again be better for the code to explode in their face. After all, it's a very easy workaround if you do want the last day of each month, as shown above
Reopening as this has come up again in https://github.com/pola-rs/polars/pull/12301
Maybe the date_range algorithm could be
while t < end:
t = start + i*offset;
ts.push(t)
i += 1;
rather than
t = start
while t < end:
t = t + offset;
ts.push(t)
Will think about this, maybe it'd be OK