polars-xdt
polars-xdt copied to clipboard
Add `ignore_nulls` (and `ignore_nan`) option to `xdt.ewma_by_time`
Currently, if there are any NaN values in the value column passed to xdt.ewma_by_time
, then all following values in the output are NaN (see snippet). It would be great if there was n ignore_nulls
flag, similar to in the builtin ewma
, to allow for NaN or null values to be ignored during calculation, to prevent this. In this case, the presence or absence of a row containing Null or NaN should have no effect on subsequent rows; i.e. the ewma
-ed output of the final row of the two following tables should be identical.
shape: (2, 2)
┌───────────┬────────────────────────────┐
│ values ┆ time │
│ --- ┆ --- │
│ f64 ┆ datetime[ns] │
╞═══════════╪════════════════════════════╡
│ -0.042898 ┆ 2000-01-01 00:00:00 │
│ 0.186466 ┆ 2000-01-01 00:00:00.000002 │
└───────────┴────────────────────────────┘
shape: (3, 2)
┌───────────┬────────────────────────────┐
│ values ┆ time │
│ --- ┆ --- │
│ f64 ┆ datetime[ns] │
╞═══════════╪════════════════════════════╡
│ -0.042898 ┆ 2000-01-01 00:00:00 │
│ NaN ┆ 2000-01-01 00:00:00.000001 │
│ 0.186466 ┆ 2000-01-01 00:00:00.000002 │
└───────────┴────────────────────────────┘
Reproducible snippet
from datetime import timedelta
import numpy as np
import polars as pl
import polars_xdt as xdt
n = 100
df = pl.DataFrame({
"values": np.linspace(0, 10, n) + 0.1 * np.random.normal(size=n),
"time": np.datetime64("2000-01-01 00:00:00") + np.asarray([i*np.timedelta64(1000, "ns") for i in range(n)])
})
new = df.with_columns(xdt.ewma_by_time("values", times="time", half_life=timedelta(microseconds=1)).alias("ewma"))
# True
print(new["ewma"].is_finite().all())
new_with_nan = df.with_columns(xdt.ewma_by_time(
pl.when(pl.col("values") > 5).then(np.nan).otherwise(pl.col("values")), times="time", half_life=timedelta(microseconds=1)
).alias("ewma"))
# False
print(new_with_nan["ewma"].is_finite().all())