FSharp.Stats
FSharp.Stats copied to clipboard
[Feature Request] Inverse CDF of distributions
Is your feature request related to a problem? Please describe. Inverse CDFs are useful for calculating credible intervals for a given distribution, among other things.
Describe the solution you'd like It would be great to have inverse CDFs for all distributions. But starting from normal distribution would be great.
The normal InvCDF for mean = 0
and sigma = 1
is already implemented at an inproper position:
https://github.com/fslaborg/FSharp.Stats/blob/b74ecf29651b32f66e328958931cdb2d2dd8dc0f/src/FSharp.Stats/Signal/QQPlot.fs#L91-L92
I agree, we should add quantile functions as InvCDF
for all distributions :+1:
I've added an InvCDF member to all distributions by 3d6a2201c45d792d95cc127c97c377fa4bcf496c.
I noticed the approximation of the inverse error function leads to some discrepancies when extreme values are chosen and compared to the R qnorm
procedure.
// Testing FSharp.Stats
(Distributions.Continuous.Normal.InvCDF 0. 1. 0.5) //0.
# Testing R
qnorm(0.5,0,1)
# Testing Python
from scipy.stats import norm
norm.ppf(0.5, loc=0, scale=1)
Mean | StDev | X | result FSharp.Stats | result R | result Python |
0 | 1 | 0.5 | 1.253321755e-09 | 0 | 0 |
0 | 1 | 0 | -infinity | -infinity | -infinity |
0 | 1 | 1 | infinity | infinity | infinity |
3 | 0.01 | 0.01 | 2.97673652985179 | 2.97673652126 | 2.97673652125959 |
-300000 | 5000 | 0.99 | -288368.2649258 | -288368.2606298 | -288368.2606297958 |
While the deviation is small and just occurs at extreme values, it would be worth checking if the approximation presented in Wichura, “Algorithm AS 241: The Percentage Points of the Normal Distribution.”, 1988 should be implemented.
- [x] add tests
- [x] check accuracy
References
- https://github.com/SurajGupta/r-source/blob/master/src/nmath/qnorm.c
I've implemented the quantile function of the normal distribution as described in Wichura et al.. Its accurate for 15 decimal places.
Progress of adding InvCDF/quantile functions t continuous distributions
- [x] Normal
- [ ] Gamma
- [ ] Beta
- [ ] F
- [ ] StudentT
- [ ] Studentized Range
- [x] LogNormal
- [ ] MultivariateNrmal
- [ ] Exponential
- [ ] Uniform <- should be easy
- [ ] Chi
- [ ] ChiSquared
Many distributions have no closed form of the quantile function. Besides published approximations would be beneficial to add a member for each Distribution that approximates the correct x for a given p. The CDF is continuously increasing and therefore a root finding approach should work just fine. I propose the following:
type MyDistribution =
static member PDF a b x = ...
static member CDF a b x = ...
static member InvCDF a b x = //possible no closed form exists
static member InvCDFApprox a b x accuracy =
///parameters: function (float -> float); accuracy (float); minimum (float); maximum (float); maxIterations
let tmp = Optimization.Bisection.tryFindRoot (fun x -> MyDistribution.CDF a b x - p) accuracy 0. 1. 1000
match tmp with
| Some x -> x
| None -> failwith "no InvCDF found to satisfy the given conditions"
Drawbacks
- While this should be feasible for any distribution, the optimization step may be quite slow.
- If the CDF itself is an approximation, an error propagation would inflate the InvCDF error.
To discuss:
- should the
InvCDFApprox
fail or result innan
when no root can be identified? - can the maxIterations be determined by accuracy? In my understanding, the range between min and max is divided into two sections during each iteration. Therefore, the accuracy should be coupled to the number of maximum iterations by: $$accuracy = 0.5^{maxIterations}$$ If this is correct, maxIterations could be set to $$System.Math.Log(accuracy,0.5)$$
- should the accuracy be given as float, or maybe model it as type like in Expecto.Accuracy.
- Is there a better alternative for
Optimization.Bisection
? Like e.g. NelderMead with (fun x -> abs (MyDistribution.CDF a b x - p)) as objective function. - Are there any better naming options as
InvCDFApprox
? - Are there more concerns, that I missed?