polars Support an `ignore_nulls` param for EWM calculations (so we can match pandas default behaviour)

Problem description

We are having a few speed-bumps migrating some pandas code to polars, due to a missing param in EWM functionality, specifically ignore_na.

Request

An additional ignore_nulls param (eg: for ewm_mean, etc) that allows pandas default EWM calculations to be matched; no change to default polars behaviour, but should have the capability to natively calculate the same values as pandas.

Example

import pandas as pd
import polars as pl

# polars (default)
df = pl.DataFrame({"x":[1,None,2,3,None,4,5,6]})
df.select( pl.col('x').ewm_mean(com=0.5) )

# pandas (default)
df.to_pandas().ewm( com=0.5 ).mean()

# pandas (ignore nulls)
df.to_pandas().ewm( com=0.5, ignore_na=True ).mean()

Comparison

Default pandas/polars behaviour does not match; can set ignore_na=True to make pandas match polars, but there is no equivalent param on polars to match the other way round.

#  polars           pandas (default)   pandas(ignore_na)
# ┌──────────┐       
# │ x        │      
# │ ---      │      
# │ f64      │      
# ╞══════════╡               x                  x
# │ 1.0      │     0  1.000000        0  1.000000
# │ 1.0      │     1  1.000000        1  1.000000
# │ 1.75     │     2  1.900000        2  1.750000
# │ 2.615385 │     3  2.702703        3  2.615385
# │ 2.615385 │     4  2.702703        4  2.615385
# │ 3.55     │     5  3.828571        5  3.550000
# │ 4.520661 │     6  4.674926        6  4.520661
# │ 5.508242 │     7  5.581665        7  5.508242
# └──────────┘

Workaround

It is possible to simulate the desired behaviour (we have put it in a custom expression namespace), but this is obviously less efficient and isn't portable (you'd need the same extension available everywhere, and it only works from Python - not Rust).

from polars.internals.expr.expr import _prepare_alpha

@pl.api.register_expr_namespace('pandas_ewm')
class PandasEWM:
    def __init__(self, expr: pl.Expr ):
        self._expr = expr

    def mean(
        self,
        com: float | None = None,
        span: float | None = None,
        half_life: float | None = None,
        alpha: float | None = None,
        adjust: bool = True,
        min_periods: int = 1,
        ignore_nulls: bool = True
    ) -> pl.Expr:
            if ignore_nulls:
                return self._expr.ewm_mean(
                    com, span, half_life, alpha, adjust, min_periods
                )
            else:
                alpha = _prepare_alpha(com, span, half_life, alpha)
                e = self._expr
                n = e.len()
                w = (
                    pl.when(e.is_null()).
                    then(0.0).
                    otherwise((1.0 - alpha) ** (n-pl.arange(1, n + 1)))
                )
                return (e.fill_null(0) * w).cumsum() / w.cumsum()

With this extension, we can now replicate the default pandas result...

df.select( 
    pl.col('x').pandas_ewm.mean(com=0.5, ignore_nulls=False)
)
# ┌──────────┐
# │ x        │
# │ ---      │
# │ f64      │
# ╞══════════╡
# │ 1.0      │
# │ 1.0      │
# │ 1.75     │
# │ 2.615385 │
# │ 2.615385 │
# │ 3.55     │
# │ 4.520661 │
# │ 5.508242 │
# └──────────┘

...but ideally we'd have a native ignore_nulls param instead.

Dec 08 '22 13:12 alexander-beedie

@matteosantama - as the resident EWM maestro, would you be interested in looking at this one? :)

Dec 08 '22 13:12 alexander-beedie

Sorry mate, school's in full swing and I'm a bit underwater.

I remember briefly looking at this, and I believe the key was adding an else arm to this

Dec 08 '22 16:12 matteosantama

Sorry mate, school's in full swing and I'm a bit underwater.

Understandable prioritisation ;) Thanks for the hint/suggestion!

Dec 08 '22 17:12 alexander-beedie

@alexander-beedie @ritchie46 @matteosantama Would you mind if I took a stab at this issue? I think I understand what needs to be done.

Jan 27 '23 20:01 2torus

@alexander-beedie @ritchie46 @matteosantama Would you mind if I took a stab at this issue? I think I understand what needs to be done.

Please 🙂

Jan 27 '23 20:01 ritchie46

I am working on a PR (https://github.com/yuntai/polars/blob/ewm_ignore_nulls/polars/polars-arrow/src/kernels/ewm/variance.rs) with support for ignore_nulls flag. I basically ported numba code from pandas(https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/window/aggregations.pyx#L1833) so the results match well with Pandas' with or witout ignore_nulls (ignore_na) flag. Currently, in Polars, there is a difference in behaviour from pandas when min_periods=1 or min_periods=0 for var & std.

[ins] In [2]: pl.Series([1.,2,3]).ewm_var(alpha=0.5)
Out[2]: 
shape: (3,)
Series: '' [f64]
[
	0.0
	0.5
	0.928571
]

[ins] In [3]: import pandas as pd

[ins] In [4]: pd.Series([1.,2,3]).ewm(alpha=0.5).var()
Out[4]: 
0         NaN
1    0.500000
2    0.928571
dtype: float64

Are we planning to maintain this difference?

Feb 07 '23 04:02 yuntai

Yes, if I recall correctly pandas is actually a bit inconsistent w.r.t. the bias parameter,

In [2]: s = pd.Series(range(4))

In [7]: s.ewm(alpha=0.5).var(bias=True)
Out[7]: 
0    0.000000
1    0.222222
2    0.530612
3    0.862222
dtype: float64

In [8]: s.ewm(alpha=0.5).var(bias=False)
Out[8]: 
0         NaN
1    0.500000
2    0.928571
3    1.385714
dtype: float64

I think returning 0.0 in both cases is the correct behavior.

Feb 07 '23 15:02 matteosantama

https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/window/aggregations.pyx#L1891 It's specifically handled in pandas. Not sure what's the rationale behind it. Perhaps when bias=False, the bias correction term is deemed not well defined? https://github.com/pandas-dev/pandas/pull/7926 https://github.com/pandas-dev/pandas/issues/7900

Feb 07 '23 16:02 yuntai

Could be. I’d still advocate for preserving our current behavior (return 0.0 in both cases)

Feb 07 '23 17:02 matteosantama

Closed by #6742 🏆

Feb 09 '23 09:02 alexander-beedie