mathnet-numerics
mathnet-numerics copied to clipboard
Repetitive calculations in statistical distribution classes
I have noticed that many statistical distribution classes duplicate code.
For example:
- Instance version of PDF() does not call the static version with the fields passed as parameters
- Samples() does not simply call Sample()
This violates the principle of Don't Repeat Yourself. Was this a deliberate decision or should the distributions be updated to minimize redundancy?
Continuous Distributions with redundant instance and static implementations:
- Cauchy
- Chi
- ChiSquared
- ContinuousUniform
- Exponential
- FisherSnedecor
- InverseGamma
- Laplace
- LogNormal
- Normal
- Pareto
- Rayleigh
- Weibull
Continuous Distributions where the instance method redirects to the static implementation:
- Beta
- Erlang
- Gamma
- Stable
- StudentT
- Triangular
Thanks for pointing out this somewhat gray area. The primary reason for having two versions is that the static one does have to do range checking while the instance one does not (as the distribution parameters have already been verified).
However, until verified by benchmarks this is a typical case of premature optimization. Branching can be very expensive, or negligible if the CPU's branch prediction works well. I'm happy to drop the duplications though if there is no significant difference between A and B (with CDF as example):
A: loop { acc += X.CDF(a, b, z); }
B: x = new X(a,b); loop { acc += x.CumulativeDistribution(z); }
Note that we have to expect these routines to be called from within an inner loop, so being 10% faster can justify some code duplication if the duplicated code is "short" and both cases are covered by tests.
I think the classical solution to the range checking issue is this: Make a private, static method which does not range check. Then have the public instance method call the private static method, and have the public static method call the private static method after checking the range.
I can understand if you consider the resulting function bloat unacceptable, though. (I really only brought this up because I was curious myself, since I'm still learning C#)
I'll try to do the benchmark you describe.
Indeed, we do exactly that in most distributions for the random number sampling, with the private static SampleUnchecked
functions. The situation is almost the same there as with PDF/CDF, except that Sampling is even more performance sensitive (since it is almost always called in a loop).