StatsBase.jl Allow and Provide Different Estimators for Autocovariance

As mentioned in #273, there are many possible estimators or calculations for the autocovariance function. For example, the autocovariance can be estimated using:

A denominator of n (or n-1 with Bessel's correction) for all lags and the naive algorithm
A denominator of n-k (or n-k-1 with Bessel's correction) for all lags k and the naive algorithm
An approximate algorithm based on the FFT, which can be calculated in O(n*log(n)) time. (I will defer to @devmotion on this, as he implemented this in MCMCChains and knows more about it than me.)
Some other estimator -- for example, Geyer's initial positive sequence estimator finds the smallest lag k where γ[k] + γ[k+1] < 0, then sets γ[k]=0 for all larger k values. (The idea being that negative autocovariance is rare, so negative values reflect noise; observing a negative value implies the noise is greater than the signal.)

I think the complexity of autocovariance estimation warrants allowing different estimators for autocovariance (like we do for the covariance) and also providing at least a couple of these.

Mar 21 '22 21:03 ParadaCarleton

It would indeed make sense to support some simple variants like we do for covariance. The more complex variants could be supported via custom AbstractCovarianceEstimator types provided by other package (again like for covariance).

Mar 23 '22 09:03 nalimilan

In MCMCChains/MCMCDiagnosticTools, we use a bit special setup, due to our application and for performance reasons, so I'm not completely sure how generally useful it is.

The implementation there, and some of the algorithms, are based on Stan and https://arxiv.org/pdf/1903.08008.pdf. The We only use the initial positive sequence estimator, which also implies that generally we only need a few lags. Hence, in contrast to what is reported in literature and done in Stan, the FFT method usually is not the most performant alternative. This is also consistent with the observation in MCMCDiagnostics, and one additional reason for why it's not the default method (another one is that it requires loading additional packages). As suggested by Geyer we also don't use denominators of n - k (or n - k - 1) in the default and the FFT method, which should basically give the same results. I don't remember the normalization of the variogram estimator from BDA right now, but I'm certain we didn't add any customizations there.

Mar 23 '22 09:03 devmotion

In MCMCChains/MCMCDiagnosticTools, we use a bit special setup, due to our application and for performance reasons, so I'm not completely sure how generally useful it is.

The implementation there, and some of the algorithms, are based on Stan and https://arxiv.org/pdf/1903.08008.pdf. The We only use the initial positive sequence estimator, which also implies that generally we only need a few lags. Hence, in contrast to what is reported in literature and done in Stan, the FFT method usually is not the most performant alternative. This is also consistent with the observation in MCMCDiagnostics, and one additional reason for why it's not the default method (another one is that it requires loading additional packages). As suggested by Geyer we also don't use denominators of n - k (or n - k - 1) in the default and the FFT method, which should basically give the same results. I don't remember the normalization of the variogram estimator from BDA right now, but I'm certain we didn't add any customizations there.

The variogram estimator always uses a denominator of n-h, which makes it much less downward-biased than the sample covariance matrix with denominator of n (what we use now).

Mar 23 '22 17:03 ParadaCarleton

As said, in the literature the use of n (and not n - k) is recommended for MCMC analysis. Surely not everyone agrees - but the nice thing is that you can just choose whatever you prefer in MCMCChains and MCMCDiagnosticTools 🙂

Mar 23 '22 17:03 devmotion

As said, in the literature the use of n (and not n - k) is recommended for MCMC analysis. Surely not everyone agrees - but the nice thing is that you can just choose whatever you prefer in MCMCChains and MCMCDiagnosticTools slightly_smiling_face

For the sample autocovariance, yes, n is better than n-k; the variogram estimator combined with the Geyer stopping rule is different though, because the differencing in the variogram yields more stable estimates. From BDA3:

Mar 25 '22 01:03 ParadaCarleton

Not sure what you want to say unfortunately. We implement exactly the version from BDA.

Mar 25 '22 21:03 devmotion

Not sure what you want to say unfortunately. We implement exactly the version from BDA.

I'm saying that the version from BDA uses n-k instead of n. Actually, this might explain why (when testing ParetoSmooth) I found it gives significantly different ESS estimates than Stan (as in, noticeably different, not practically different). I'll make an issue.

Mar 26 '22 16:03 ParadaCarleton

I'm saying that the version from BDA uses n-k instead of n.

Sure, but that's what's implemented in MCMCDiagnosticTools. I don't understand where the problem is.

Mar 26 '22 16:03 devmotion

Stan uses the FFT approach, so any other comparison will be flawed.

Mar 26 '22 16:03 devmotion

StatsBase.jl StatsBase.jl copied to clipboard

Allow and Provide Different Estimators for Autocovariance

StatsBase.jl
StatsBase.jl copied to clipboard