Replace known-problematic variance algorithm (for `column_variance`)
The existing algorithm for column_variance uses the textbook formula (E[X^2] - E[X]^2), which is well-established to have numerical issues. While the intention (traversal
of the elements in column-major order) of the extant algorithm is apparent, we should not sacrifice precision when we do not need to -- the two-pass algorithm for variance (N.B. the existing algorithm is already a two-pass algorithm, anyway) using the formula E[(x - E[x])(x - E[x]]) can be substituted without issue.
Notably, the other variance implementations in the statistics module use E[(x -E[x])(x - E[x]]). Loss of precision aside, keeping the existing implementation of column_variance causes the obvious absurdity:
use nalgebra::Matrix2x3;
let m = Matrix2x3::new(1.0, 2.0, 3.0, 4.0, 5.0, 6.0);
assert_ne!(m.column_variance().transpose(), m.transpose().row_variance());
We can eliminate both the loss of precision the glaring inconsistency by switching to the implementation provided by this PR.
For a comprehensive analysis of variance algorithms, see this reference, in particular, Table 2. The "two-pass" described in the paper is the implementation given in this PR. In terms of simplicity (hence, easier to maintain), "two-pass" is a suitable choice; in terms of runtime performance and precision, it is a good balance (c.f. Youngs & Cramer and "textbook"). Furthermore, it is consistent with the variance algorithm used in the other "*variance" algorithms in the statistics module.
I considered the compress_columns route, but wanted to avoid unnecessary allocations with dynamically-sized data.
Stats aren't really my forte, but looks good to me. However, I think if there's an improvement in precision, it would be good to actually try to verify this with a test that failed before and passes now.
I'm also concerned about the general lack of tests for this method. There appears only to be the single doc test in its documentation. This is not the fault of @andrewjradcliffe, of course, but it would be nice to have a few more tests that ensure that we're not introducing any new bugs here.
it would be nice to have a few more tests that ensure that we're not introducing any new bugs here.
Indeed, it would be nice to have test coverage for the statistics-related methods -- though, computation of variance is really the only place mistakes can occur.
I take it the proposal is for me to write a set of tests which fail under the current algorithm -- due to loss of precision -- but succeed with the new?
it would be nice to have a few more tests that ensure that we're not introducing any new bugs here.
Indeed, it would be nice to have test coverage for the statistics-related methods -- though, computation of variance is really the only place mistakes can occur.
Well, I think ideally we would already have tests for:
- a square matrix
- a rectangular matrix
m x nwithm < n - a rectangular matrix
m x nwithm > n - edge cases:
0 x 0,m x 0,0 x n
Unfortunately we don't have this at the moment, so we can only hope that the existing implementation works correctly for these cases. However, I'm a little reluctant to merge a new implementation without these tests, at the very least to make sure we don't regress on any of these cases. Would you be able to add these tests?
I take it the proposal is for me to write a set of tests which fail under the current algorithm -- due to loss of precision -- but succeed with the new?
I think, if the claim is that the new implementation has higher precision, it would indeed be prudent to somehow try to verify that it actually delivers this — at least for one test case.