polars-xdt icon indicating copy to clipboard operation
polars-xdt copied to clipboard

Add `ignore_nulls` (and `ignore_nan`) option to `xdt.ewma_by_time`

Open wbeardall opened this issue 11 months ago • 7 comments

Currently, if there are any NaN values in the value column passed to xdt.ewma_by_time, then all following values in the output are NaN (see snippet). It would be great if there was n ignore_nulls flag, similar to in the builtin ewma, to allow for NaN or null values to be ignored during calculation, to prevent this. In this case, the presence or absence of a row containing Null or NaN should have no effect on subsequent rows; i.e. the ewma-ed output of the final row of the two following tables should be identical.

shape: (2, 2)
┌───────────┬────────────────────────────┐
│ values    ┆ time                       │
│ ---       ┆ ---                        │
│ f64       ┆ datetime[ns]               │
╞═══════════╪════════════════════════════╡
│ -0.042898 ┆ 2000-01-01 00:00:00        │
│ 0.186466  ┆ 2000-01-01 00:00:00.000002 │
└───────────┴────────────────────────────┘
shape: (3, 2)
┌───────────┬────────────────────────────┐
│ values    ┆ time                       │
│ ---       ┆ ---                        │
│ f64       ┆ datetime[ns]               │
╞═══════════╪════════════════════════════╡
│ -0.042898 ┆ 2000-01-01 00:00:00        │
│ NaN       ┆ 2000-01-01 00:00:00.000001 │
│ 0.186466  ┆ 2000-01-01 00:00:00.000002 │
└───────────┴────────────────────────────┘

Reproducible snippet

from datetime import timedelta

import numpy as np
import polars as pl
import polars_xdt as xdt


n = 100

df = pl.DataFrame({
    "values": np.linspace(0, 10, n) + 0.1 * np.random.normal(size=n),
    "time": np.datetime64("2000-01-01 00:00:00") + np.asarray([i*np.timedelta64(1000, "ns") for i in range(n)])
})


new = df.with_columns(xdt.ewma_by_time("values", times="time", half_life=timedelta(microseconds=1)).alias("ewma"))

# True
print(new["ewma"].is_finite().all())

new_with_nan = df.with_columns(xdt.ewma_by_time(
    pl.when(pl.col("values") > 5).then(np.nan).otherwise(pl.col("values")), times="time", half_life=timedelta(microseconds=1)
).alias("ewma"))

# False
print(new_with_nan["ewma"].is_finite().all())

wbeardall avatar Mar 22 '24 18:03 wbeardall