mapme.biodiversity Variance of Batch processing results

Currently, the summary statistics for indicator values in batch processing are combined by nesting them. For variance, this leads to unexpected results.

normally, you'd estimate the variance as:

$Var(X) = \frac{1} {(N - 1)} \sum(x_i - \overline x)^2$

which we can decompose into

$Var(X) = \frac{\sum(n_i - 1) * s_i^2 + \sum n_i * (m_i - M)^2} {N - 1}$

where

$n_i$ are the lengths of the batched vectors.
$s_i^2$ are the variances of the batched vectors.
$m_i$ are the means of the batched vectors.
$M$ is the overall mean of the original vector. --> this can be obtained by calculating the weighted mean of the batched vectors
$N$ is the total size of the original vector.

This works in general, but I have not had a look yet how practical this would be to implement in the package, just wanted to start this issue as a FYI for now.

x <- runif(100)
chunk_size <- 3
x_chunk <- split(x, ceiling(seq_along(x) / chunk_size))

n_i <- sapply(x_chunk, length)
s_i2 <- sapply(x_chunk, var)
m_i <- sapply(x_chunk, mean)

N <- sum(n_i)
M <- sum((n_i * m_i)) / N

chunked_var <- (sum((n_i - 1) * s_i2, na.rm = TRUE) + sum(n_i * (m_i - M)^2, na.rm = TRUE)) / (N - 1)

testthat::expect_equal(
  var(x),
  chunked_var
)

Aug 15 '24 07:08 karpfen

Thank you for looking into this! I suppose the right place to hook up a function like this to aggregate chunks is here

https://github.com/mapme-initiative/mapme.biodiversity/blob/f92256e727d92ea816e558d8b9cb890c41e3c85f/R/chunking.R#L130-L144

Aug 15 '24 10:08 goergen95

Thanks, this is the place I was looking for. Before I start on this, should I wait for your performance branches to be merged? I don't think it would cause too much friction, actually.

Aug 15 '24 11:08 karpfen

No, please go ahead and open an PR if you are ready.

Aug 15 '24 11:08 goergen95