StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

autocov incorrect?

Open tpapp opened this issue 8 years ago • 7 comments

Consider an alternating sequence of 1 and -1, which has autocovariances 1 and -1 at even and odd lags.

using StatsBase
x = ones(20)
x[1:2:end] = -1.0
ac = autocov(x, (1:length(x))-1; demean = false)

But

julia> showcompact(ac)
[1.0, -0.95, 0.9, -0.85, 0.8, -0.75, 0.7, -0.65, 0.6, -0.55, 0.5, -0.45, 0.4, -0.35, 0.3, -0.25, 0.2, -0.15, 0.1, -0.05]                                             

I wonder if one should divide by the number of elements in the calculation instead of the length of the whole sequence.

tpapp avatar Jun 10 '17 06:06 tpapp

I have looked at some textbooks, and both forms are in use (ie with denominators N and N-k, where N is the length of the sequence and k is the lag). The form with N has some nice definiteness properties when written as a matrix, but it is not unbiased. Some texts argue that since k << N, the difference should not matter. Variograms & related lit tend to use N-k. Perhaps the documentation could clarify.

tpapp avatar Jun 10 '17 07:06 tpapp

This is on purpose. I think it is the standard thing to do, e.g. it is the estimator in Brockwell and Davis and it is what R does. Notice that the N-k version isn't unbiased. In general, autocovariance (or any other time series) estimators aren't unbiased. PRs with improvements to the documentation are always welcome.

andreasnoack avatar Jun 10 '17 11:06 andreasnoack

I noticed this weird (to me) behavior just today. The normalization has a large effect on estimates of the spectrum derived from the estimated ACF. Compare the two ACF estimates below acf The white line is the StatsBase behavior and the orange one is normalized by (N-k+1) If estimating spectra based on these two ACFs, we obtain the following result spectrum I find the normalization by N as opposed to N-k odd since the name of the function is autocov but it does not give you cov between signal and it's lagged version. It's also not mentioned in the docstrings, so I think this issue deserves to remain open.

baggepinnen avatar Mar 04 '20 09:03 baggepinnen

What is the data generating process and what is the sample size in those examples?

andreasnoack avatar Mar 04 '20 12:03 andreasnoack

T = 100
fs = 10
F = [1,3]
t = range(0,step=1/fs,length=T)
y = sum(sin.(2π*f .* t .+ 2π*rand()) for f in F) .+ 0.1 .* randn.()

baggepinnen avatar Mar 04 '20 14:03 baggepinnen

Brockwell and Davis does discuss this, and they propose the use of n in their covariance functions. They also mention n-k, which is biased, but is 'less' biased. For smaller datasets, this makes a significant difference. If you were doing this by hand, you only have n-k pairs; use of n wouldn't make sense. The question is whether retaining n to guarantee the positive definite covariance matrix is critical? I'm needing to roll my own cov functions to allow the n-k approach.

quebbs avatar Oct 04 '20 04:10 quebbs

The n-k-1 estimator is unbiased. The current implementation shrinks large lags to 0 by dividing by n, which usually reduces mean squared error. The major problem with the n-k estimator is that it is not even consistent -- as the dataset grows, the number of lags estimated increases as well. Because the second-to-last lag always has a variance based on two samples, the variance of the last lag. By contrast, dividing by n yields a consistent estimator whenever the true autocovariance goes to 0 (the usual case). It is especially bad when trying to estimate not particular lags, but instead the variance of a time series' mean. In that case, using n-k will frequently give inconsistent estimates, e.g. that a series' total autocorrelation exceeds 1.

The only real solution is something like #774, I think.

ParadaCarleton avatar Mar 21 '22 21:03 ParadaCarleton

Closed as duplicate of #774

ParadaCarleton avatar Oct 28 '23 02:10 ParadaCarleton