precise icon indicating copy to clipboard operation
precise copied to clipboard

Unittests for precision

Open kno10 opened this issue 2 years ago • 3 comments

I would not just rely on a blog post that is based on Wikipedia... Instead, devise unit tests to check the precision and performance of the methods.

A discussion of different approaches, their performance, and their accuracy can be found in the publication:

Schubert, Erich; Gertz, Michael (9 July 2018). Numerically stable parallel computation of (co-)variance. ACM. p. 10. doi:10.1145/3221269.3223036. ISBN 9781450365055. S2CID 49665540.

It also discusses AVX parallelization to further improve performance, but that is mostly relevant for the univariate case.

Either way, it discusses the main numerical issue with the naive approach, and this way how to test the accuracy against this problem.

(And, no, 1e-3 is not consider to be precise. If you really want to have high precision, you will likely want to even use Kahan summation or the Shewchuk algorithm if you are willing to trade run time performance.)

kno10 avatar Jan 08 '22 16:01 kno10

Hi Erich,

Wow, thanks for your paper. Exactly what I was looking for. Interesting comments re Postgres, Spark etc.

And yes, best to live up to both meanings of "precise" I guess - though the name refers to inv cov. That said, I fixed, for now, the unit test I think you were referring to. The author had confused sample and population estimates and the 1e-3 masked that :) Tolerance is 1e-12 now.

Do you have code floating around for the weighted cov updating in Python (or another language)? Or unit tests with problematic cases?

Peter

microprediction avatar Jan 13 '22 05:01 microprediction

For this kind of code, pure python is unsuitable - the python interpreter is too slow, the boxing overheads are too large. You need to use at least Cython.

The main problematic cases are discussed in the paper. If the data is not centered, but say centered around 1e8 with a variance of 1, then the "algebraic" approach tends to break down because of floating point precision issues.

If you want more difficult cases - even computing the mean of 1e100,3,-1e100 is surprisingly difficult:

>>> sum([1e100,3,-1e100])/3
0.0
>>> numpy.mean([1e100,3,-1e100])
0.0

but IMHO handling this by default (e.g., using Kahan or Shewchuk) is too expensive.

kno10 avatar Jan 23 '22 16:01 kno10

Yes, I've encountered the somewhat unpredictable rounding issue in a different context. However, most people's problems don't seem to hit this. Certainly, I am yet to, touch wood, (with the possible exception of the Huber method - and there a simple rescaling solved the problem.)

While it is nice to have an empirical covariance calculation suitable in extreme ranges, that isn't necessarily that useful for my problems. The empirical covariance isn't terribly competitive as a forecaster of future covariance, which is the primary goal here.

As for introducing Cython, I might go there eventually. However, to reference the Huber example again, I very much doubt that would lead to a performance improvement since most numpy functions are vectorized including the newton method (as I discovered to my pleasant surprise).

microprediction avatar Jan 26 '22 12:01 microprediction