polars icon indicating copy to clipboard operation
polars copied to clipboard

Different output between Polars and Pandas using date_range with interval="1mo"

Open tomaslucas opened this issue 3 years ago • 6 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

Hi, I am trying to create a dataframe with date_range and an interval "1mo" but after the month of February, the last day for all the months is the same, the 29th, instead of the one that would correspond to each month. I have tried the same with Pandas and here the result is correct.

Screenshot 2022-11-07 at 17 35 08

Reproducible example

import polars as pl
from datetime import date

start = date(2020,1,31)
stop = date(2021,1,31)
df_pl = (
    pl.DataFrame(
        {
            "date":pl.date_range(start,stop,interval="1mo")
        }
    )
)

Expected behavior

date
2020-01-31
2020-02-29
2020-03-31
2020-04-30
2020-05-31

...

This Pandas code generate the correct output:

import pandas as pd
from datetime import date

start = date(2020,1,31)
stop = date(2021,1,31)
df_pd =(
    pd.DataFrame(
        {
            "date":pd.date_range(start, stop, freq='M')
        }
    )
)

Installed versions

---Version info---
Polars: 0.14.26
Index type: UInt32
Platform: macOS-12.4-x86_64-i386-64bit
Python: 3.9.7 (default, Sep  1 2021, 18:01:29) 
[Clang 12.0.0 (clang-1200.0.32.2)]
---Optional dependencies---
pyarrow: 10.0.0
pandas: 1.5.1
numpy: 1.23.4
fsspec: <not installed>
connectorx: 0.3.1
xlsx2csv: 0.8
matplotlib: <not installed>

tomaslucas avatar Nov 07 '22 16:11 tomaslucas

This is where dates get annoying/weird. Why oh why did they give each month a different number of days? What is one month after February 28th? March 28th or March 31st? Is it the same thing as one month after February 29th? If the former, what is one month before March 31st?

We have ill-defined concepts of months in our heads that don't have a great transformation. I think the idea that's in most peoples' heads is covered by these two rules:

  1. If the start day of a sequence is the last day of the month, then the user intends for 1mo to respect the "last day of the month" rule, regardless of how many days are in the current month. So if the start date is 2020-02-29, then the next item would be 2020-03-31. If the start date is 2021-02-28, the next item would be 2021-03-31.
  2. If the start day of a sequence is any other number, then 1mo means "the same day of the month next month, unless that number doesn't fit into the month, in which case the end of the month" So for example, if your start date is 2022-01-30, then we'd get 2022-02-29, 2022-03-30, etc.

mcrumiller avatar Nov 07 '22 21:11 mcrumiller

I agree. But all the references that I have found to "1mo" refer to calendar month. For example: https://docs.rs/polars/latest/polars/prelude/trait.PolarsUpsample.html

tomaslucas avatar Nov 08 '22 06:11 tomaslucas

@tomaslucas the point is that "calendar month" is not well-defined duration. The same is true for calendar years due to leap years/seconds but it's generally less of an issue.

mcrumiller avatar Nov 08 '22 14:11 mcrumiller

Still, I agree that the behaviour of Polars is peculiar. It's not what I would expect. I think we can do better here, although as you say there might not be an obvious best solution.

What would make sense to me is to simply increment the month, and keep the day the same, bounded by the maximum day available in that month. Then you'll get the pandas output.

stinodego avatar Nov 11 '22 15:11 stinodego

I think a special eom function would work. Alternatively, you could do a series of day-1 dates by month, and then subtract one day, to get the end of month.

mcrumiller avatar Nov 11 '22 15:11 mcrumiller

I think it's important to point out that pd.date_range(date(2020,1,15), date(2020,6,15), freq="M") returns

DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30', '2020-05-31'], dtype='datetime64[ns]', freq='M')

so while that output is great if your input parameters are the last days of the month, it, in my opinion, loses its crown for intuitive output quite easily.

If what you want is the last day of every month then probably better to just start big and then filter down.

ie.


    df_pl = (
    pl.DataFrame(
        {
            "date":pl.date_range(date(2020,1,31),date(2021,1,31),interval="1d")
        }
    )
).filter((pl.col('date')+pl.duration(days=1)).dt.month()!=pl.col('date').dt.month())

deanm0000 avatar Nov 14 '22 20:11 deanm0000

First off, in pandas, 'M' (confusingly) means 'month end ', so that's why it always goes to the last day of the month.

If you had

In [3]: import pandas as pd
   ...: from datetime import date
   ...:
   ...: start = date(2020,1,15)
   ...: stop = date(2021,1,15)
   ...: df_pd =(
   ...:     pd.DataFrame(
   ...:         {
   ...:             "date":pd.date_range(start, stop, freq='M')
   ...:         }
   ...:     )
   ...: )

then you'd also get

In [4]: df_pd
Out[4]:
         date
0  2020-01-31
1  2020-02-29
2  2020-03-31
3  2020-04-30
4  2020-05-31
5  2020-06-30
6  2020-07-31
7  2020-08-31
8  2020-09-30
9  2020-10-31
10 2020-11-30
11 2020-12-31

, which I'd bet is unexpected to most readers. There is an issue in pandas about renaming 'M' to 'ME', but I'm just explaining why this snippet appears to magically just work in pandas (when in reality 'M' doesn't mean what people think it does).


Back to the problem at hand. In this case, you can achieve your desired output with just a single line of code:

pl.date_range(date(2020, 1, 1), date(2021, 2, 1), '1mo', eager=True).dt.month_end()

To avoid surprises, how about raising if adding a month offset can't keep the same day-of-the-month? This way:

  • if someone wanted the same day of the month, then that's impossible, so better to have the code explode in their face straight away
  • if someone wanted the last day of the month, then this wouldn't give it to them anyway, and it would again be better for the code to explode in their face. After all, it's a very easy workaround if you do want the last day of each month, as shown above

MarcoGorelli avatar Apr 08 '23 15:04 MarcoGorelli

Reopening as this has come up again in https://github.com/pola-rs/polars/pull/12301

Maybe the date_range algorithm could be

while t < end:
    t = start + i*offset;
    ts.push(t)
    i += 1;

rather than

t = start
while t < end:
    t = t + offset;
    ts.push(t)

Will think about this, maybe it'd be OK

MarcoGorelli avatar Nov 08 '23 09:11 MarcoGorelli