Support an `ignore_nulls` param for EWM calculations (so we can match pandas default behaviour)
Problem description
We are having a few speed-bumps migrating some pandas code to polars, due to a missing param in EWM functionality, specifically ignore_na.
Request
An additional ignore_nulls param (eg: for ewm_mean, etc) that allows pandas default EWM calculations to be matched; no change to default polars behaviour, but should have the capability to natively calculate the same values as pandas.
Example
import pandas as pd
import polars as pl
# polars (default)
df = pl.DataFrame({"x":[1,None,2,3,None,4,5,6]})
df.select( pl.col('x').ewm_mean(com=0.5) )
# pandas (default)
df.to_pandas().ewm( com=0.5 ).mean()
# pandas (ignore nulls)
df.to_pandas().ewm( com=0.5, ignore_na=True ).mean()
Comparison
Default pandas/polars behaviour does not match; can set ignore_na=True to make pandas match polars, but there is no equivalent param on polars to match the other way round.
# polars pandas (default) pandas(ignore_na)
# ┌──────────┐
# │ x │
# │ --- │
# │ f64 │
# ╞══════════╡ x x
# │ 1.0 │ 0 1.000000 0 1.000000
# │ 1.0 │ 1 1.000000 1 1.000000
# │ 1.75 │ 2 1.900000 2 1.750000
# │ 2.615385 │ 3 2.702703 3 2.615385
# │ 2.615385 │ 4 2.702703 4 2.615385
# │ 3.55 │ 5 3.828571 5 3.550000
# │ 4.520661 │ 6 4.674926 6 4.520661
# │ 5.508242 │ 7 5.581665 7 5.508242
# └──────────┘
Workaround
It is possible to simulate the desired behaviour (we have put it in a custom expression namespace), but this is obviously less efficient and isn't portable (you'd need the same extension available everywhere, and it only works from Python - not Rust).
from polars.internals.expr.expr import _prepare_alpha
@pl.api.register_expr_namespace('pandas_ewm')
class PandasEWM:
def __init__(self, expr: pl.Expr ):
self._expr = expr
def mean(
self,
com: float | None = None,
span: float | None = None,
half_life: float | None = None,
alpha: float | None = None,
adjust: bool = True,
min_periods: int = 1,
ignore_nulls: bool = True
) -> pl.Expr:
if ignore_nulls:
return self._expr.ewm_mean(
com, span, half_life, alpha, adjust, min_periods
)
else:
alpha = _prepare_alpha(com, span, half_life, alpha)
e = self._expr
n = e.len()
w = (
pl.when(e.is_null()).
then(0.0).
otherwise((1.0 - alpha) ** (n-pl.arange(1, n + 1)))
)
return (e.fill_null(0) * w).cumsum() / w.cumsum()
With this extension, we can now replicate the default pandas result...
df.select(
pl.col('x').pandas_ewm.mean(com=0.5, ignore_nulls=False)
)
# ┌──────────┐
# │ x │
# │ --- │
# │ f64 │
# ╞══════════╡
# │ 1.0 │
# │ 1.0 │
# │ 1.75 │
# │ 2.615385 │
# │ 2.615385 │
# │ 3.55 │
# │ 4.520661 │
# │ 5.508242 │
# └──────────┘
...but ideally we'd have a native ignore_nulls param instead.
@matteosantama - as the resident EWM maestro, would you be interested in looking at this one? :)
Sorry mate, school's in full swing and I'm a bit underwater.
I remember briefly looking at this, and I believe the key was adding an else arm to this
Sorry mate, school's in full swing and I'm a bit underwater.
Understandable prioritisation ;) Thanks for the hint/suggestion!
@alexander-beedie @ritchie46 @matteosantama Would you mind if I took a stab at this issue? I think I understand what needs to be done.
@alexander-beedie @ritchie46 @matteosantama Would you mind if I took a stab at this issue? I think I understand what needs to be done.
Please 🙂
I am working on a PR (https://github.com/yuntai/polars/blob/ewm_ignore_nulls/polars/polars-arrow/src/kernels/ewm/variance.rs) with support for ignore_nulls flag. I basically ported numba code from pandas(https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/window/aggregations.pyx#L1833) so the results match well with Pandas' with or witout ignore_nulls (ignore_na) flag. Currently, in Polars, there is a difference in behaviour from pandas when min_periods=1 or min_periods=0 for var & std.
[ins] In [2]: pl.Series([1.,2,3]).ewm_var(alpha=0.5)
Out[2]:
shape: (3,)
Series: '' [f64]
[
0.0
0.5
0.928571
]
[ins] In [3]: import pandas as pd
[ins] In [4]: pd.Series([1.,2,3]).ewm(alpha=0.5).var()
Out[4]:
0 NaN
1 0.500000
2 0.928571
dtype: float64
Are we planning to maintain this difference?
Yes, if I recall correctly pandas is actually a bit inconsistent w.r.t. the bias parameter,
In [2]: s = pd.Series(range(4))
In [7]: s.ewm(alpha=0.5).var(bias=True)
Out[7]:
0 0.000000
1 0.222222
2 0.530612
3 0.862222
dtype: float64
In [8]: s.ewm(alpha=0.5).var(bias=False)
Out[8]:
0 NaN
1 0.500000
2 0.928571
3 1.385714
dtype: float64
I think returning 0.0 in both cases is the correct behavior.
https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/window/aggregations.pyx#L1891 It's specifically handled in pandas. Not sure what's the rationale behind it. Perhaps when bias=False, the bias correction term is deemed not well defined? https://github.com/pandas-dev/pandas/pull/7926 https://github.com/pandas-dev/pandas/issues/7900
Could be. I’d still advocate for preserving our current behavior (return 0.0 in both cases)
Closed by #6742 🏆