FSharp.Stats icon indicating copy to clipboard operation
FSharp.Stats copied to clipboard

[Feature Request] Inverse CDF of distributions

Open ghost opened this issue 1 year ago • 5 comments

Is your feature request related to a problem? Please describe. Inverse CDFs are useful for calculating credible intervals for a given distribution, among other things.

Describe the solution you'd like It would be great to have inverse CDFs for all distributions. But starting from normal distribution would be great.

ghost avatar Apr 24 '23 12:04 ghost

The normal InvCDF for mean = 0 and sigma = 1 is already implemented at an inproper position:

https://github.com/fslaborg/FSharp.Stats/blob/b74ecf29651b32f66e328958931cdb2d2dd8dc0f/src/FSharp.Stats/Signal/QQPlot.fs#L91-L92

I agree, we should add quantile functions as InvCDF for all distributions :+1:

bvenn avatar Apr 24 '23 12:04 bvenn

I've added an InvCDF member to all distributions by 3d6a2201c45d792d95cc127c97c377fa4bcf496c. I noticed the approximation of the inverse error function leads to some discrepancies when extreme values are chosen and compared to the R qnorm procedure.

// Testing FSharp.Stats
(Distributions.Continuous.Normal.InvCDF 0. 1. 0.5)  //0.
# Testing R
qnorm(0.5,0,1)
# Testing Python
from scipy.stats import norm
norm.ppf(0.5, loc=0, scale=1)
Mean StDev X result FSharp.Stats result R result Python
0 1 0.5 1.253321755e-09 0 0
0 1 0 -infinity -infinity -infinity
0 1 1 infinity infinity infinity
3 0.01 0.01 2.97673652985179 2.97673652126 2.97673652125959
-300000 5000 0.99 -288368.2649258 -288368.2606298 -288368.2606297958

While the deviation is small and just occurs at extreme values, it would be worth checking if the approximation presented in Wichura, “Algorithm AS 241: The Percentage Points of the Normal Distribution.”, 1988 should be implemented.

  • [x] add tests
  • [x] check accuracy

References

  • https://github.com/SurajGupta/r-source/blob/master/src/nmath/qnorm.c

bvenn avatar Apr 25 '23 15:04 bvenn

I've implemented the quantile function of the normal distribution as described in Wichura et al.. Its accurate for 15 decimal places.

bvenn avatar May 02 '23 06:05 bvenn

Progress of adding InvCDF/quantile functions t continuous distributions

  • [x] Normal
  • [ ] Gamma
  • [ ] Beta
  • [ ] F
  • [ ] StudentT
  • [ ] Studentized Range
  • [x] LogNormal
  • [ ] MultivariateNrmal
  • [ ] Exponential
  • [ ] Uniform <- should be easy
  • [ ] Chi
  • [ ] ChiSquared

bvenn avatar Jul 04 '23 08:07 bvenn

Many distributions have no closed form of the quantile function. Besides published approximations would be beneficial to add a member for each Distribution that approximates the correct x for a given p. The CDF is continuously increasing and therefore a root finding approach should work just fine. I propose the following:

type MyDistribution =

    static member PDF a b x = ...

    static member CDF a b x = ...

    static member InvCDF a b x = //possible no closed form exists

    static member InvCDFApprox a b x accuracy = 
        ///parameters: function (float -> float); accuracy (float); minimum (float); maximum (float); maxIterations
        let tmp = Optimization.Bisection.tryFindRoot (fun x -> MyDistribution.CDF a b x - p) accuracy 0. 1. 1000
        match tmp with 
        | Some x -> x
        | None -> failwith "no InvCDF found to satisfy the given conditions"

Drawbacks

  • While this should be feasible for any distribution, the optimization step may be quite slow.
  • If the CDF itself is an approximation, an error propagation would inflate the InvCDF error.

To discuss:

  • should the InvCDFApprox fail or result in nan when no root can be identified?
  • can the maxIterations be determined by accuracy? In my understanding, the range between min and max is divided into two sections during each iteration. Therefore, the accuracy should be coupled to the number of maximum iterations by: $$accuracy = 0.5^{maxIterations}$$ If this is correct, maxIterations could be set to $$System.Math.Log(accuracy,0.5)$$
  • should the accuracy be given as float, or maybe model it as type like in Expecto.Accuracy.
  • Is there a better alternative for Optimization.Bisection? Like e.g. NelderMead with (fun x -> abs (MyDistribution.CDF a b x - p)) as objective function.
  • Are there any better naming options as InvCDFApprox?
  • Are there more concerns, that I missed?

bvenn avatar Jul 20 '23 12:07 bvenn