Variance of Batch processing results
Currently, the summary statistics for indicator values in batch processing are combined by nesting them. For variance, this leads to unexpected results.
normally, you'd estimate the variance as:
$Var(X) = \frac{1} {(N - 1)} \sum(x_i - \overline x)^2$
which we can decompose into
$Var(X) = \frac{\sum(n_i - 1) * s_i^2 + \sum n_i * (m_i - M)^2} {N - 1}$
where
- $n_i$ are the lengths of the batched vectors.
- $s_i^2$ are the variances of the batched vectors.
- $m_i$ are the means of the batched vectors.
- $M$ is the overall mean of the original vector. --> this can be obtained by calculating the weighted mean of the batched vectors
- $N$ is the total size of the original vector.
This works in general, but I have not had a look yet how practical this would be to implement in the package, just wanted to start this issue as a FYI for now.
x <- runif(100)
chunk_size <- 3
x_chunk <- split(x, ceiling(seq_along(x) / chunk_size))
n_i <- sapply(x_chunk, length)
s_i2 <- sapply(x_chunk, var)
m_i <- sapply(x_chunk, mean)
N <- sum(n_i)
M <- sum((n_i * m_i)) / N
chunked_var <- (sum((n_i - 1) * s_i2, na.rm = TRUE) + sum(n_i * (m_i - M)^2, na.rm = TRUE)) / (N - 1)
testthat::expect_equal(
var(x),
chunked_var
)
Thank you for looking into this! I suppose the right place to hook up a function like this to aggregate chunks is here
https://github.com/mapme-initiative/mapme.biodiversity/blob/f92256e727d92ea816e558d8b9cb890c41e3c85f/R/chunking.R#L130-L144
Thanks, this is the place I was looking for. Before I start on this, should I wait for your performance branches to be merged? I don't think it would cause too much friction, actually.
No, please go ahead and open an PR if you are ready.