polars icon indicating copy to clipboard operation
polars copied to clipboard

`dt.truncate` issue

Open westandskif opened this issue 2 years ago • 2 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

It feels like there's a problem with the different bases of modulo operations for months and seconds (link). I stumbled across it because I was introducing similar functionality into my own lib.

Truncating to months uses, let's say, 0000-12-31 (Sun) date as a base, while truncating to days uses 1970-01-01 as a base (Thu). So, e.g. truncating days of any multiple of 7 yields Thursdays just because of the beginning of Linux epoch.

Please, find the most generic example I managed to come up with, to illustrate the problem of different bases:

Reproducible example

import polars as pl
from datetime import date

dt = date(1970, 1, 4)

assert (
    pl.DataFrame({"a": [dt]})
    .with_column(pl.col("a").dt.truncate(f"{dt.toordinal()}d"))
    .to_dicts()
) == [{'a': date(1970, 1, 1)}]
# the date shouldn't have changed here as we used the exact number of days as a base

Expected behavior

import polars as pl
from datetime import date

dt = date(1970, 1, 4)

assert (
    pl.DataFrame({"a": [dt]})
    .with_column(pl.col("a").dt.truncate(f"{dt.toordinal()}d"))
    .to_dicts()
) == [{'a': date(1970, 1, 4)}]

Installed versions

---Version info---
Polars: 0.15.9
Index type: UInt32
Platform: macOS-12.6-arm64-arm-64bit
Python: 3.9.12 (main, Oct 17 2022, 16:06:02)
[Clang 14.0.0 (clang-1400.0.29.201)]
---Optional dependencies---
pyarrow: 10.0.1
pandas: <not installed>
numpy: 1.24.1
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: <not installed>

westandskif avatar Dec 30 '22 21:12 westandskif

I don't know what should be done here. What do you propose?

ritchie46 avatar Dec 31 '22 13:12 ritchie46

I don't have specific suggestions at the moment, I'm still thinking about what to introduce into my library, so I have only thoughts:

  • weeks are confusing, so probably I'll replace them with sun/mon/.../sat. This way the first day of the week is obvious.
  • everything should use same base: Sunday, 0000-12-31 ("zero" of Monday, 0001-01-01)
  • nanoseconds don't seem to be convenient to use since these don't fit into i64
In [32]: math.log2(1e9 * 86400 * 365 * 3000)
Out[32]: 66.35859598507508
  • fortunately python's datetime doesn't support nanoseconds, only microseconds which fit into i64 -- this can work as the smallest unit:
In [34]: math.log2(1e6 * 86400 * 365 * 3000)
Out[34]: 56.39281170041299

westandskif avatar Jan 02 '23 15:01 westandskif

hi @westandskif , thanks for the report

Truncating to months uses, let's say, 0000-12-31 (Sun) date as a base

I'm not really sure what you mean, but truncating by month goes to the beginning of the month:

In [4]: pl.Series([datetime(2020, 10, 25)]).dt.truncate('1mo')
Out[4]:
shape: (1,)
Series: '' [datetime[μs]]
[
        2020-10-01 00:00:00
]

You can use offset if the default beginning of the window doesn't suit your use case

MarcoGorelli avatar Jun 25 '23 20:06 MarcoGorelli

hi @MarcoGorelli , let's probably close this one as I don't really need anything from it.

I just pointed at different beginnings of month and day truncate operations, which breaks the code it shouldn't, as far as I understand (see the one with toordinal, if needed). To me it is like broken maths to have to different zeros

westandskif avatar Jun 25 '23 21:06 westandskif

sure, closing then

I don't have specific suggestions at the moment

feel free to ping if you do have specific suggestions, and this can be reopened

MarcoGorelli avatar Jun 26 '23 10:06 MarcoGorelli