OnlineStats.jl icon indicating copy to clipboard operation
OnlineStats.jl copied to clipboard

What algorithms are being used for the mean and variance?

Open jj-du-plessis opened this issue 11 months ago • 3 comments

As far as I can tell from the documentation of Mean() and Variance(), the algorithms on which your implementations are based are not documented. As an academic user, it is problematic for me to use your implementation which claims to be an online algorithm without any references. I ask that you please add information/references about the algorithms used to the documentation.

jj-du-plessis avatar Jan 14 '25 12:01 jj-du-plessis

The algorithm for the mean is trivial.

if:

M[n] = sum(x[1:n]) / n

then:

M[n+1] = (x[n+1] + sum(x[1:n])) / (n+1)
       = (x[n+1] + n*M[n]) / (n+1)
       = x[n+1] / (n+1) + n / (n+1) * M[n]

Let g = 1/(n+1) and then you have

M[n+1] = g * x[n+1] + (1-g) * M[n]
       = M[n] + g * (x[n+1] - M[n])
       = smooth(M[n], x[n+1], 1 / (n+1))

which is the formula being used at: https://github.com/joshday/OnlineStatsBase.jl/blob/master/src/stats.jl#L471-L473

with the smooth method found here: https://github.com/joshday/OnlineStatsBase.jl/blob/master/src/OnlineStatsBase.jl#L184

For standard weights, you have weight(n) = 1/n, so altogether you get o.μ = smooth(o.μ, x, o.weight(o.n += 1)).

For the variance, it appears to be Welford's online algorithm. You can check the wikipedia page and follow the citations there.

adknudson avatar Jan 26 '25 05:01 adknudson

A lot of the algorithms here are covered in my dissertation: https://repository.lib.ncsu.edu/server/api/core/bitstreams/1c760e12-e284-43e0-a6c4-0271cc84912b/content (PDF).

joshday avatar Jan 26 '25 14:01 joshday

Thank you both!

@joshday as a possible improvement to the package docs, please cite your dissertation somewhere prominently in the documentation so users can figure out which algorithms are used if they need to.

jj-du-plessis avatar Jan 26 '25 18:01 jj-du-plessis