polars
polars copied to clipboard
`dt.truncate` issue
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
It feels like there's a problem with the different bases of modulo operations for months and seconds (link). I stumbled across it because I was introducing similar functionality into my own lib.
Truncating to months uses, let's say, 0000-12-31 (Sun) date as a base, while truncating to days uses 1970-01-01 as a base (Thu). So, e.g. truncating days of any multiple of 7 yields Thursdays just because of the beginning of Linux epoch.
Please, find the most generic example I managed to come up with, to illustrate the problem of different bases:
Reproducible example
import polars as pl
from datetime import date
dt = date(1970, 1, 4)
assert (
pl.DataFrame({"a": [dt]})
.with_column(pl.col("a").dt.truncate(f"{dt.toordinal()}d"))
.to_dicts()
) == [{'a': date(1970, 1, 1)}]
# the date shouldn't have changed here as we used the exact number of days as a base
Expected behavior
import polars as pl
from datetime import date
dt = date(1970, 1, 4)
assert (
pl.DataFrame({"a": [dt]})
.with_column(pl.col("a").dt.truncate(f"{dt.toordinal()}d"))
.to_dicts()
) == [{'a': date(1970, 1, 4)}]
Installed versions
---Version info---
Polars: 0.15.9
Index type: UInt32
Platform: macOS-12.6-arm64-arm-64bit
Python: 3.9.12 (main, Oct 17 2022, 16:06:02)
[Clang 14.0.0 (clang-1400.0.29.201)]
---Optional dependencies---
pyarrow: 10.0.1
pandas: <not installed>
numpy: 1.24.1
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: <not installed>
I don't know what should be done here. What do you propose?
I don't have specific suggestions at the moment, I'm still thinking about what to introduce into my library, so I have only thoughts:
- weeks are confusing, so probably I'll replace them with sun/mon/.../sat. This way the first day of the week is obvious.
- everything should use same base: Sunday, 0000-12-31 ("zero" of Monday, 0001-01-01)
- nanoseconds don't seem to be convenient to use since these don't fit into i64
In [32]: math.log2(1e9 * 86400 * 365 * 3000)
Out[32]: 66.35859598507508
- fortunately python's datetime doesn't support nanoseconds, only microseconds which fit into i64 -- this can work as the smallest unit:
In [34]: math.log2(1e6 * 86400 * 365 * 3000)
Out[34]: 56.39281170041299
hi @westandskif , thanks for the report
Truncating to months uses, let's say, 0000-12-31 (Sun) date as a base
I'm not really sure what you mean, but truncating by month goes to the beginning of the month:
In [4]: pl.Series([datetime(2020, 10, 25)]).dt.truncate('1mo')
Out[4]:
shape: (1,)
Series: '' [datetime[μs]]
[
2020-10-01 00:00:00
]
You can use offset if the default beginning of the window doesn't suit your use case
hi @MarcoGorelli , let's probably close this one as I don't really need anything from it.
I just pointed at different beginnings of month and day truncate operations, which breaks the code it shouldn't, as far as I understand (see the one with toordinal, if needed). To me it is like broken maths to have to different zeros
sure, closing then
I don't have specific suggestions at the moment
feel free to ping if you do have specific suggestions, and this can be reopened