polars
polars copied to clipboard
ewm_std disagrees with pandas ewm().std() and with rms approximation
What language are you using?
Python
Have you tried latest version of polars?
Yes
What version of polars are you using?
0.13.55
What operating system are you using polars on?
Windows 10
What language version are you using
python 3.8
Describe your bug.
pandas series.ewm(N).std() and polars pl.col(x).ewm_std(N) disagree
What are the steps to reproduce the behavior?
import polars as pl
import pandas as pd
l = pl.DataFrame({"a":list(range(-100, 101))}).with_columns([pl.col("a").ewm_std(201).alias("b")])
r = pd.DataFrame({"a":list(range(-100, 101))}).assign(b = lambda df: df["a"].ewm(201).std().fillna(0))
l.to_pandas()["b"] - r["b"]
What is the actual behavior?
l.to_pandas()["b"] - r["b"]
0 0.000000
1 -0.636921
2 -0.873638
3 -1.102503
4 -1.324247
...
196 -11.715175
197 -11.706087
198 -11.696753
199 -11.687175
200 -11.677356
Name: b, Length: 201, dtype: float64
(l.to_pandas()["b"] - r["b"]).min()
-11.856254921469628
What is the expected behavior?
In the above example pandas and polars differ by around 25%. In the original case I encountered this, which I can't share, the pandas ewm stdev is approximately 3x the polars ewm stdev. I'd expect them to agree somewhat more closely. Using root-mean-square as an approximation gets me pretty close agreement with the pandas result, which makes me inclined to think that it's the polars implementation at fault here.
perhaps a better indicator of the differences:
(l.to_pandas()["b"] / r["b"])
0 NaN
1 0.099257
2 0.126360
3 0.146000
4 0.162464
...
196 0.789660
197 0.790833
198 0.791998
199 0.793156
200 0.794306
Name: b, Length: 201, dtype: float64
For reference this is what we have implemented: https://stats.stackexchange.com/a/111912/147321
I haven't worked it through to satisfy myself that it's the problem, but the fact that the impl doesn't look at the adjust flag looks suspicious
If someone can find the proper formula this should be pretty doable as first issue given that you have Rust experience. Anyone willing to help?
I haven't worked it through to satisfy myself that it's the problem, but the fact that the impl doesn't look at the
adjustflag looks suspicious
It does. We first compute ewm_mean which corrects for adjust.
This is what is implemented: https://web.archive.org/web/20181222175223/http://people.ds.cam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf
Pandas implementation: https://stackoverflow.com/questions/40754262/pandas-ewm-std-calculation
Ok, I've narrowed down the issue a bit. It appears that pandas uses (what they claim to be) an unbiased variance calculation (see here). This parameter does not work through the .expanding().var() API, though.
In [5]: pd.Series(range(10)).expanding().var(alpha=0.5, adjust=False, bias=True)
Out[5]:
0 NaN
1 0.500000
2 1.000000
3 1.666667
4 2.500000
5 3.500000
6 4.666667
7 6.000000
8 7.500000
9 9.166667
dtype: float64
In [6]: pd.Series(range(10)).expanding().var(alpha=0.5, adjust=False, bias=False)
Out[6]:
0 NaN
1 0.500000
2 1.000000
3 1.666667
4 2.500000
5 3.500000
6 4.666667
7 6.000000
8 7.500000
9 9.166667
dtype: float64
However, if we manually construct the ExponentialMovingWindow ourselves, we see the difference
In [7]: from pandas.core.window.ewm import ExponentialMovingWindow
In [8]: ExponentialMovingWindow(obj=pd.Series(range(10)), alpha=0.5, adjust=False).var(bias=True)
Out[8]:
0 0.000000
1 0.250000
2 0.687500
3 1.109375
4 1.433594
5 1.655273
6 1.796631
7 1.882751
8 1.933578
9 1.962887
dtype: float64
In [9]: ExponentialMovingWindow(obj=pd.Series(range(10)), alpha=0.5, adjust=False).var(bias=False)
Out[9]:
0 NaN
1 0.500000
2 1.100000
3 1.690476
4 2.158824
5 2.485337
6 2.695604
7 2.824300
8 2.900412
9 2.944341
dtype: float64
Note that this matches the polars result
In [11]: pl.Series(range(10)).ewm_var(alpha=0.5, adjust=False)
Out[11]:
shape: (10,)
Series: '' [f64]
[
0.0
0.25
0.6875
1.109375
1.433594
1.655273
1.796631
1.882751
1.933578
1.962887
]
For reference, here is some information on constructing unbiased estimates.
I was able to replicate the bias = False output
mean = {}
biased_var = {}
unbiased_var = {}
x = np.arange(10)
alpha = 0.5
for n, x_n in enumerate(x):
prev_mean = mean.get(n - 1, x_n)
mean[n] = (1 - alpha) * prev_mean + alpha * x_n
prev_biased_var = biased_var.get(n - 1, 0)
biased_var[n] = (1 - alpha) * (prev_biased_var + alpha * np.square(x_n - prev_mean))
weights = np.power(1 - alpha, np.arange(n + 1))
weights[:-1] *= alpha
correction = (weights.sum()**2) / ((weights.sum()**2) - (weights**2).sum())
unbiased_var[n] = biased_var[n] * correction
>>> pd.DataFrame(
{
"mean": pd.Series(mean),
"biased_var": pd.Series(biased_var),
"unbiased_var": pd.Series(unbiased_var),
}
)
mean biased_var unbiased_var
0 -10.000000 0.000000 NaN
1 -9.500000 0.250000 0.500000
2 -8.750000 0.687500 1.100000
3 -7.875000 1.109375 1.690476
4 -6.937500 1.433594 2.158824
5 -5.968750 1.655273 2.485337
6 -4.984375 1.796631 2.695604
7 -3.992188 1.882751 2.824300
8 -2.996094 1.933578 2.900412
9 -1.998047 1.962887 2.944341
This StackOverflow comment was helpful.
Would a PR for supporting the bias flag be welcome?
Thanks for disecting this @matteosantama, and great to have some insights on what we do different than pandas. :+1:
A PR supprorting the bias would be very welcome. :raised_hands:
I think we also should document why we differ from pandas.