polars ewm_std disagrees with pandas ewm().std() and with rms approximation

trafficstars

What language are you using?

Python

Have you tried latest version of polars?

Yes

What version of polars are you using?

0.13.55

What operating system are you using polars on?

Windows 10

What language version are you using

python 3.8

Describe your bug.

pandas series.ewm(N).std() and polars pl.col(x).ewm_std(N) disagree

What are the steps to reproduce the behavior?

import polars as pl
import pandas as pd

l = pl.DataFrame({"a":list(range(-100, 101))}).with_columns([pl.col("a").ewm_std(201).alias("b")])
r = pd.DataFrame({"a":list(range(-100, 101))}).assign(b = lambda df: df["a"].ewm(201).std().fillna(0))

l.to_pandas()["b"] - r["b"]

What is the actual behavior?

l.to_pandas()["b"] - r["b"]

0       0.000000
1      -0.636921
2      -0.873638
3      -1.102503
4      -1.324247
         ...    
196   -11.715175
197   -11.706087
198   -11.696753
199   -11.687175
200   -11.677356
Name: b, Length: 201, dtype: float64

(l.to_pandas()["b"] - r["b"]).min()
-11.856254921469628

What is the expected behavior?

In the above example pandas and polars differ by around 25%. In the original case I encountered this, which I can't share, the pandas ewm stdev is approximately 3x the polars ewm stdev. I'd expect them to agree somewhat more closely. Using root-mean-square as an approximation gets me pretty close agreement with the pandas result, which makes me inclined to think that it's the polars implementation at fault here.

Jul 20 '22 14:07 AlecZorab

perhaps a better indicator of the differences:

(l.to_pandas()["b"] / r["b"])
0           NaN
1      0.099257
2      0.126360
3      0.146000
4      0.162464
         ...   
196    0.789660
197    0.790833
198    0.791998
199    0.793156
200    0.794306
Name: b, Length: 201, dtype: float64

Jul 20 '22 15:07 AlecZorab

For reference this is what we have implemented: https://stats.stackexchange.com/a/111912/147321

Jul 20 '22 15:07 ritchie46

I haven't worked it through to satisfy myself that it's the problem, but the fact that the impl doesn't look at the adjust flag looks suspicious

Jul 20 '22 15:07 AlecZorab

If someone can find the proper formula this should be pretty doable as first issue given that you have Rust experience. Anyone willing to help?

Jul 24 '22 08:07 ritchie46

I haven't worked it through to satisfy myself that it's the problem, but the fact that the impl doesn't look at the adjust flag looks suspicious

It does. We first compute ewm_mean which corrects for adjust.

Jul 29 '22 18:07 ritchie46

This is what is implemented: https://web.archive.org/web/20181222175223/http://people.ds.cam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf

Jul 29 '22 18:07 ritchie46

Pandas implementation: https://stackoverflow.com/questions/40754262/pandas-ewm-std-calculation

Jul 29 '22 19:07 ritchie46

Ok, I've narrowed down the issue a bit. It appears that pandas uses (what they claim to be) an unbiased variance calculation (see here). This parameter does not work through the .expanding().var() API, though.

In [5]: pd.Series(range(10)).expanding().var(alpha=0.5, adjust=False, bias=True)
Out[5]: 
0         NaN
1    0.500000
2    1.000000
3    1.666667
4    2.500000
5    3.500000
6    4.666667
7    6.000000
8    7.500000
9    9.166667
dtype: float64

In [6]: pd.Series(range(10)).expanding().var(alpha=0.5, adjust=False, bias=False)
Out[6]: 
0         NaN
1    0.500000
2    1.000000
3    1.666667
4    2.500000
5    3.500000
6    4.666667
7    6.000000
8    7.500000
9    9.166667
dtype: float64

However, if we manually construct the ExponentialMovingWindow ourselves, we see the difference

In [7]: from pandas.core.window.ewm import ExponentialMovingWindow

In [8]: ExponentialMovingWindow(obj=pd.Series(range(10)), alpha=0.5, adjust=False).var(bias=True)
Out[8]: 
0    0.000000
1    0.250000
2    0.687500
3    1.109375
4    1.433594
5    1.655273
6    1.796631
7    1.882751
8    1.933578
9    1.962887
dtype: float64

In [9]: ExponentialMovingWindow(obj=pd.Series(range(10)), alpha=0.5, adjust=False).var(bias=False)
Out[9]: 
0         NaN
1    0.500000
2    1.100000
3    1.690476
4    2.158824
5    2.485337
6    2.695604
7    2.824300
8    2.900412
9    2.944341
dtype: float64

Note that this matches the polars result

In [11]: pl.Series(range(10)).ewm_var(alpha=0.5, adjust=False)
Out[11]: 
shape: (10,)
Series: '' [f64]
[
	0.0
	0.25
	0.6875
	1.109375
	1.433594
	1.655273
	1.796631
	1.882751
	1.933578
	1.962887
]

For reference, here is some information on constructing unbiased estimates.

Aug 02 '22 14:08 matteosantama

I was able to replicate the bias = False output

mean = {}
biased_var = {}
unbiased_var = {}

x = np.arange(10)
alpha = 0.5

for n, x_n in enumerate(x):
    prev_mean = mean.get(n - 1, x_n)
    mean[n] = (1 - alpha) * prev_mean + alpha * x_n
    
    prev_biased_var = biased_var.get(n - 1, 0)
    biased_var[n] = (1 - alpha) * (prev_biased_var + alpha * np.square(x_n - prev_mean))
    
    weights = np.power(1 - alpha, np.arange(n + 1))
    weights[:-1] *= alpha
    
    correction = (weights.sum()**2) / ((weights.sum()**2) - (weights**2).sum())
    unbiased_var[n] = biased_var[n] * correction
    

>>> pd.DataFrame(
    {
        "mean": pd.Series(mean),
        "biased_var": pd.Series(biased_var),
        "unbiased_var": pd.Series(unbiased_var),
    }
)

        mean  biased_var  unbiased_var
0 -10.000000    0.000000           NaN
1  -9.500000    0.250000      0.500000
2  -8.750000    0.687500      1.100000
3  -7.875000    1.109375      1.690476
4  -6.937500    1.433594      2.158824
5  -5.968750    1.655273      2.485337
6  -4.984375    1.796631      2.695604
7  -3.992188    1.882751      2.824300
8  -2.996094    1.933578      2.900412
9  -1.998047    1.962887      2.944341

This StackOverflow comment was helpful.

Would a PR for supporting the bias flag be welcome?

Aug 02 '22 16:08 matteosantama

Thanks for disecting this @matteosantama, and great to have some insights on what we do different than pandas. :+1:

A PR supprorting the bias would be very welcome. :raised_hands:

I think we also should document why we differ from pandas.

Aug 03 '22 08:08 ritchie46

polars polars copied to clipboard

ewm_std disagrees with pandas ewm().std() and with rms approximation

What language are you using?

Have you tried latest version of polars?

What version of polars are you using?

What operating system are you using polars on?

What language version are you using

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

polars
polars copied to clipboard