polars
polars copied to clipboard
Upsampling unexpectedly discards higher frequency data
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
from datetime import datetime as dt, timedelta
import io
import polars as pl
import polars.testing
def test_upsample() -> None:
df = pl.from_records(
[
(dt(2021, 1, 1) + t, val)
for (t, val) in [
(timedelta(minutes=0), 1),
(timedelta(minutes=1), 1),
]
],
schema={"t": pl.Datetime("ms", "UTC"), "val": pl.Int32()},
).sort("t")
upsampled = df.upsample("t", every="120s")
print(upsampled)
# shape: (1, 2)
# ┌─────────────────────────┬─────┐
# │ t ┆ val │
# │ --- ┆ --- │
# │ datetime[ms, UTC] ┆ i32 │
# ╞═════════════════════════╪═════╡
# │ 2021-01-01 00:00:00 UTC ┆ 1 │
# └─────────────────────────┴─────┘
print(df)
# shape: (2, 2)
# ┌─────────────────────────┬─────┐
# │ t ┆ val │
# │ --- ┆ --- │
# │ datetime[ms, UTC] ┆ i32 │
# ╞═════════════════════════╪═════╡
# │ 2021-01-01 00:00:00 UTC ┆ 1 │
# │ 2021-01-01 00:01:00 UTC ┆ 1 │
# └─────────────────────────┴─────┘
pl.testing.assert_frame_equal(upsampled, df)
if __name__ == "__main__":
test_upsample()
Log output
> env POLARS_VERBOSE=1 poetry run python demo.py
shape: (1, 2)
┌─────────────────────────┬─────┐
│ t ┆ val │
│ --- ┆ --- │
│ datetime[ms, UTC] ┆ i32 │
╞═════════════════════════╪═════╡
│ 2021-01-01 00:00:00 UTC ┆ 1 │
└─────────────────────────┴─────┘
shape: (2, 2)
┌─────────────────────────┬─────┐
│ t ┆ val │
│ --- ┆ --- │
│ datetime[ms, UTC] ┆ i32 │
╞═════════════════════════╪═════╡
│ 2021-01-01 00:00:00 UTC ┆ 1 │
│ 2021-01-01 00:01:00 UTC ┆ 1 │
└─────────────────────────┴─────┘
Traceback (most recent call last):
File "/home/jarrad/src/sandpit/pl-upsample/demo.py", line 46, in <module>
test_upsample()
File "/home/jarrad/src/sandpit/pl-upsample/demo.py", line 42, in test_upsample
pl.testing.assert_frame_equal(upsampled, df)
File "/home/jarrad/src/sandpit/pl-upsample/.venv/lib/python3.10/site-packages/polars/testing/asserts/frame.py", line 104, in assert_frame_equal
raise_assertion_error(
File "/home/jarrad/src/sandpit/pl-upsample/.venv/lib/python3.10/site-packages/polars/testing/asserts/utils.py", line 12, in raise_assertion_error
raise AssertionError(msg) from cause
AssertionError: DataFrames are different (number of rows does not match)
[left]: 1
[right]: 2
Issue description
When running upsample, existing data at a higher frequency than the upsample target seems to be discarded.
E.g. in the example, my dataframe has two rows at 00:00 and 00:01, and I upsample to 2m (same result with 1h, 61s, ...).
After upsampling, the second row has been discarded.
Expected behavior
Maybe naïvely, I expected existing higher-frequency data to be preserved.
If this behaviour from polars is intentional, then I'd at least have expected to see something in the upsample docs like
Data in
dfat a higher frequency thaneverywill be discarded.
or at a minimum
Upsample's behaviour on data at a higher frequency thaneveryis undefined and subject to change in future versions of Polars.
Thanks for your work maintaining Polars!
Why are you upsampling to a lower frequency than your data, are you stupid?
For context, my situation here was that I have data that in a mix of daily and hourly. There are missing records in both parts. I was attempting to upsample to daily first, because I wanted to treat missing records here differently to missing hourly records. However, this got me to hit this bug/unexpected behaviour, because upsampling to daily chucked away some of the hourly records.
Installed versions
--------Version info---------
Polars: 0.20.7
Index type: UInt32
Platform: Linux-6.6.10-76060610-generic-x86_64-with-glibc2.35
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: <not installed>
gevent: <not installed>
hvplot: <not installed>
matplotlib: <not installed>
numpy: <not installed>
openpyxl: <not installed>
pandas: <not installed>
pyarrow: <not installed>
pydantic: <not installed>
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
I don't know for sure but I suspect it is intended behavior. This way, after running upsample you can rely on the spacing from row to row being consistent. I think you need to do something like this:
df.join(
df.upsample("t", every="120s"),
on='t', how='outer_coalesce'
)
Not sure on the intended behaviour, but this is what is happening:
upsample generates the range and performs a left join.
https://github.com/pola-rs/polars/blob/6ab6550ad476227cca550cd24a692d1a00a8c937/crates/polars-time/src/upsample.rs#L177-L193
The range in this case just returns the start value (as the interval exceeds the range)
df.select(
range = pl.date_range(pl.col("t").min(), pl.col("t").max(), interval="120s")
)
# shape: (1, 1)
# ┌─────────────────────────┐
# │ range │
# │ --- │
# │ datetime[ms, UTC] │
# ╞═════════════════════════╡
# │ 2021-01-01 00:00:00 UTC │
# └─────────────────────────┘
I guess there are two schools of thought.
- Add a parameter to upsample to do an outer join instead of left
- Shrug and suggest people do the full join themselves.
Doing #2 is certainly better than what I said above which does 2 joins when you only need 1.
I have the same issue.
One correction: it is not discarding only existing higher frequency, in fact it discards any data, which are not matching the target sampling frequency.
It is surprising and I would invite:
- either the documentation
- explicitly explaining this behaviour
- and providing an example how to succeed
- or the upsample method behave more as expected by many
My attempt to rewrite my code from pandas to polars failed on this and shows, that in this regard (resampling) pandas is providing better service. (I am aware pandas is very long with us and I really love polars)
I have the same issue.
One correction: it is not discarding only existing higher frequency, in fact it discards any data, which are not matching the target sampling frequency.
It is surprising and I would invite:
either the documentation
- explicitly explaining this behaviour
- and providing an example how to succeed
or the upsample method behave more as expected by many
My attempt to rewrite my code from pandas to polars failed on this and shows, that in this regard (resampling) pandas is providing better service. (I am aware pandas is very long with us and I really love polars)
Yeah in my case, i have low frequency data, but because the points do not perfectly align it strait up discards data. Very misleading function!