stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

Proposal for statistical distributions

Open Jim-215-Fisher opened this issue 4 years ago • 37 comments

Besides common descriptive statistics, we need standard modules for various continuous statistical distribution (e.g., gamma distribution) and discrete distribution (e.g., bernoulli distribution). These statistical distributions will be very useful to various computer simulation techniques. Even though these functions are available in Scipy package for python, I think it is worthwhile to have in stdlib with pure Fortran. There are plenty of source codes on the net, we just need to convert them.

Jim-215-Fisher avatar Sep 21 '20 16:09 Jim-215-Fisher

Prior art.

  • GSL has a comprehensive section of random number distributions that should be available in Fortran though the fgsl module. This should be stable and well maintained.
  • Richard Chandler and Paul Northrop from UCL have developed a nice library in ancient FORTRAN77. Latest update has been in 2003. License is very "liberal" (they only ask to be cited) but the interface is not simple. Documentation is minimal but seems sufficient. No examples or tutorials are provided.
  • A library for normal distributions only, one more generic with a large number of distributions and another one focussed on cumulative density function are provided on Jacob Burkardt site. This is Fortran 90 compliant code. The license is GNU LGPL and documentation is adequate with examples.
  • This post in a blog is the translation in Modern Fortran of a public domain, simple, Julia library about different distributions. It's simple but documentation is almost non-existent.
  • The site of Jean-Pierre Moreau has a number of files available (apparently in a somewhat modern Fortran). However, some files appear to be taken from the book "Numerical Recipes" (which has a peculiar license) and it is not clear if the distributions are included.

epagone avatar Sep 21 '20 18:09 epagone

I agree with @Jim-215-Fisher 's comment. Thank you for the suggestion.

To add to the prior art mentioned by @epagone:

  • RANLIB: Fortran90 (license: LGPL) (and its F77 version). I believe this library is also used in Octave.

jvdp1 avatar Sep 21 '20 19:09 jvdp1

Should stdlib be based on GSL through fgsl?

Jim-215-Fisher avatar Sep 21 '20 23:09 Jim-215-Fisher

@Jim-215-Fisher Unfortunately GSL is GPL licensed, which prohibits its inclusion into stdlib which is MIT licensed.

certik avatar Sep 22 '20 02:09 certik

Don't forget the Software from Alan J. Miller: https://wp.csiro.au/alanmiller/random.html (a mirror exists at https://jblevins.org/mirror/amiller/)

ivan-pi avatar Sep 22 '20 10:09 ivan-pi

i agree with that: it is an extensive collection provided by a well-known professional in the field. It may require some reorganisation, as it is not organised in modules and the like, but it is certainly worth our while.

Op di 22 sep. 2020 om 12:59 schreef Ivan [email protected]:

Don't forget the Software from Alan J. Miller: https://wp.csiro.au/alanmiller/random.html (a mirror exists at https://jblevins.org/mirror/amiller/)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/234#issuecomment-696649490, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YRZSJTTC675SELMLNWLSHB7Q3ANCNFSM4RUXIQFQ .

arjenmarkus avatar Sep 22 '20 11:09 arjenmarkus

i agree with that: it is an extensive collection provided by a well-known professional in the field. It may require some reorganisation, as it is not organised in modules and the like, but it is certainly worth our while. Op di 22 sep. 2020 om 12:59 schreef Ivan [email protected]: Don't forget the Software from Alan J. Miller: https://wp.csiro.au/alanmiller/random.html (a mirror exists at https://jblevins.org/mirror/amiller/) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#234 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YRZSJTTC675SELMLNWLSHB7Q3ANCNFSM4RUXIQFQ .

Indeed, I agree. However, it is unclear to me what the license is for these files.

jvdp1 avatar Sep 22 '20 11:09 jvdp1

It says that the code written by Alan Miller is in the public domain, I will ask Jason what this means exactly.

Op di 22 sep. 2020 om 13:07 schreef Jeremie Vandenplas < [email protected]>:

i agree with that: it is an extensive collection provided by a well-known professional in the field. It may require some reorganisation, as it is not organised in modules and the like, but it is certainly worth our while. Op di 22 sep. 2020 om 12:59 schreef Ivan [email protected]: … <#m_-2168024105408715737_> Don't forget the Software from Alan J. Miller: https://wp.csiro.au/alanmiller/random.html (a mirror exists at https://jblevins.org/mirror/amiller/) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#234 (comment) https://github.com/fortran-lang/stdlib/issues/234#issuecomment-696649490>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YRZSJTTC675SELMLNWLSHB7Q3ANCNFSM4RUXIQFQ .

Indeed, I agree. However, it is unclear to me what the license is for these files.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/234#issuecomment-696652720, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR2XPP2KHV2B46IZE3LSHCANJANCNFSM4RUXIQFQ .

arjenmarkus avatar Sep 22 '20 11:09 arjenmarkus

If it is public domain, then there is no copyrights.

Jim-215-Fisher avatar Sep 22 '20 17:09 Jim-215-Fisher

In terms of license, as long as there are mathematic formulae, I think license issue should not be a concern.

Based on comments so far, it looks like majority agreed to have statistical distributions in stdlib. So the next question is what kind of API should be. We can either define a data type for each distribution and various type-bound procedures, or define various procedures for each type of distribution directly in the stdlib module. For example, normal distribution one could have:

module stdlib type :: norm_dist_t real :: x real :: loc real :: scale

contains generic :: pdf => norm_dist_pdf generic :: cdf => norm_dist_cdf . . . end type norm_dist_t

contains {procedure definition} . . . end module stdlib

or directly in stdlib module:

module stdlib function norm_dist_pdf(x, loc, scale) end function norm_dist_pdf . . . end module stdlib

The first one is object oriented, data and procedure are encapsulated. The second one is more traditional like intrinsic function call familiar to users. Which one is better?

Jim-215-Fisher avatar Sep 22 '20 22:09 Jim-215-Fisher

In these matters it is not likely that there is a "best" solution. It is a matter of taste, I would say. But I am leaning towards a procedural style in this. Typical use of this functionality:

  • I have a time series and I want to see if it can be fitted to a normal or log-normal distribution. In that case, I would pass the time series to some function that uses a relevant statistical test (say Lillifors) to determine whether the fit is good enough. Rather than setting up an object that takes the confidence level and perhaps a few other data, why not use a simple function?
  • I have determined the mean and standard deviation of my data set and new data come in. Do they follow the same distribution? Again a function seems more appropriate.

I do see the attractiveness of an object-oriented interface, especially if you want to examine several different data sets against the same distribution, but it feels indirect in many cases. So I would prefer a functional/procedural interface, at least for the moment. An OO interface can be added later.

Op wo 23 sep. 2020 om 00:56 schreef Jing [email protected]:

In terms of license, as long as there are mathematic formulae, I think license issue should not be a concern.

Based on comments so far, it looks like majority agreed to have statistical distributions in stdlib. So the next question is what kind of API should be. We can either define a data type for each distribution and various type-bound procedures, or define various procedures for each type of distribution directly in the stdlib module. For example, normal distribution one could have:

module stdlib type :: norm_dist_t real :: x real :: loc real :: scale

contains generic :: pdf => norm_dist_pdf generic :: cdf => norm_dist_cdf . . . end type norm_dist_t

contains {procedure definition} . . . end module stdlib

or directly in stdlib module:

module stdlib function norm_dist_pdf(x, loc, scale) end function norm_dist_pdf . . . end module stdlib

The first one is object oriented, data and procedure are encapsulated. The second one is more traditional like intrinsic function call familiar to users. Which one is better?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/234#issuecomment-697025016, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR74DUUVPIAE2UBA4WDSHETTHANCNFSM4RUXIQFQ .

arjenmarkus avatar Sep 23 '20 08:09 arjenmarkus

My preference is a procedural style. This choice "procedural" vs "OOP" style has been already discussed in previous issues (I can't find them back), and, if I remember wel, procedural style should be prefered over OOP. Of course similar functionality in a OOP style could be always proposed on top of the procedural ones.

jvdp1 avatar Sep 23 '20 18:09 jvdp1

Yeh, I have checked the style guide and can't find any recommendation. Anyway, procedure style is the Fortran way. I will go ahead to implement a small module.

Jim-215-Fisher avatar Sep 23 '20 22:09 Jim-215-Fisher

@Jim-215-Fisher we should put it into the style guide, great idea. Here are some links where it was discussed in the past:

  • https://github.com/fortran-lang/stdlib/issues/220#issuecomment-663111227
  • https://github.com/fortran-lang/stdlib/issues/14#issuecomment-570442716
  • https://github.com/fortran-lang/stdlib/issues/229#issuecomment-693447968
  • https://github.com/fortran-lang/stdlib/issues/135#issuecomment-597139437

certik avatar Sep 23 '20 23:09 certik

@certik Thanks for the links. BTW, is there a table/list/link showing status of each stdlib module/proposal?

Jim-215-Fisher avatar Sep 24 '20 01:09 Jim-215-Fisher

The latest status is in the open issue and PR for a given proposal. We don't have a nice table summarizing it.

On Wed, Sep 23, 2020, at 7:19 PM, Jing wrote:

@certik https://github.com/certik Thanks for the links. BTW, is there a table/list/link showing status of each stdlib module/proposal?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/234#issuecomment-698055250, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFAWGOUJQG5L5OALIM5ITSHKNDHANCNFSM4RUXIQFQ.

certik avatar Sep 24 '20 03:09 certik

It says that the code written by Alan Miller is in the public domain, I will ask Jason what this means exactly.

Arjen, thanks for writing to ask about the licensing. When I took over hosting Alan Miller's files no license was stated. So I asked him about that and he told me he intended his code to be public domain. So, his work can be incorporated into libraries, such as this one, with other licenses.

jrblevin avatar Sep 25 '20 14:09 jrblevin

Hi Jason,

great, this work deserves widespread use. And having it available via the Fortran Wiki and hopefully at some point via the standard library (or something similar) will make it much easier.

Op vr 25 sep. 2020 om 16:09 schreef Jason Blevins <[email protected]

:

It says that the code written by Alan Miller is in the public domain, I will ask Jason what this means exactly.

Arjen, thanks for writing to ask about the licensing. When I took over hosting Alan Miller's files no license was stated. So I asked him about that and he told me he intended his code to be public domain. So, his work can be incorporated into libraries, such as this one, with other licenses.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/234#issuecomment-698950724, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR5H3ALI5OV5IHOSLOLSHSP73ANCNFSM4RUXIQFQ .

arjenmarkus avatar Sep 25 '20 14:09 arjenmarkus

Thank oyu @jrblevin and @arjenmarkus for these explanations. These codes would be a good start for this proposal IMO.

jvdp1 avatar Sep 25 '20 16:09 jvdp1

In terms of Alan Miller's code, should we use his code/module directly, or reorganize it according to current style?

Jim-215-Fisher avatar Sep 27 '20 00:09 Jim-215-Fisher

I would say we need to reorganize it - make sure things are consistent within the standard library. That will help people to understand how to use it and to avoid certain types of mistakes.

Op zo 27 sep. 2020 om 02:22 schreef Jing [email protected]:

In terms of Alan Miller's code, should we use his code/module directly, or reorganize it according to current style?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/234#issuecomment-699564501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR3AE3OZ7AA3OV5VG3LSH2ATZANCNFSM4RUXIQFQ .

arjenmarkus avatar Sep 27 '20 10:09 arjenmarkus

I agree with @arjenmarkus . We will need to re-organize the code (and probably re-write some parts) to be consistent with stdlib style guide. It is anyway a great start.

It seems that the people involved in this thread would agree to use Alan Miller 's code as a starting point, and that a procedural approach should be used (in agreement with several other discussions). If I may propose, @Jim-215-Fisher would you be interested to start a proposal for API?

jvdp1 avatar Sep 28 '20 06:09 jvdp1

Yes, I am working on it, hopefully send PR soon.

Jim-215-Fisher avatar Sep 28 '20 13:09 Jim-215-Fisher

Hi. A certain number of the routines from Alan Miller's site are from other sources - for example, the Applied Statistics routines have a funny licence (no commercial use), which tripped me up when I used them for an R package. The Rmath library might be a good model for what working statisticians want (esp in terms of numerics).

David-Duffy avatar Oct 08 '20 00:10 David-Duffy

The initial implementation of this proposal is ready for PR. Anyone interested is welcome to review it. I have implemented three distributions at the moment, uniform, normal and binomial distribution random number generators and pdf, cdf functions. Will implement more distributions and functions after collecting comments and suggestions. The code is in my branch https://github.com/Jim-215-Fisher/stdlib/tree/stats_distribution

Jim-215-Fisher avatar Oct 10 '20 20:10 Jim-215-Fisher

The initial implementation of this proposal is ready for PR. Anyone interested is welcome to review it. I have implemented three distributions at the moment, uniform, normal and binomial distribution random number generators and pdf, cdf functions. Will implement more distributions and functions after collecting comments and suggestions. The code is in my branch https://github.com/Jim-215-Fisher/stdlib/tree/stats_distribution

Thank you @Jim-215-Fisher . Can you open a PR? It will be easier for reviewing and discussing the API and code.

jvdp1 avatar Oct 10 '20 20:10 jvdp1

I'd like to call out this RNG library, which I use regularly. There is a Fortran wrapper to the version producing 32bit random values online but I think it's worth implementing the whole thing. I use this library with C++. There the API is that one first instantiates an RNG object (say pcg), and a distribution object (say normal_dist) and calls like normal_dist(pcg) return a normal variate using this particular RNG. I find this API very smooth and flexible. Allows switching out RNGs. But I have no problem with a strictly procedural implementation either.

Also, looking at some of the implementations in the links, I see that a few of them generate 2 uniform random numbers to return only 1 variate. See for example, the Box-Muller implementation in GSL, where they explain their choice. This is may be a bit wasteful as two covariates might be returned instead of 1, when the algorithm produces 2 independent covariates. Saving one of the covariates computed using a SAVE attribute or simply filling entire arrays at once may be a better implementation.

esterjo avatar Dec 17 '20 14:12 esterjo

I'd like to call out this RNG library, which I use regularly. There is a Fortran wrapper to the version producing 32bit random values online but I think it's worth implementing the whole thing. I use this library with C++. There the API is that one first instantiates an RNG object (say pcg), and a distribution object (say normal_dist) and calls like normal_dist(pcg) return a normal variate using this particular RNG. I find this API very smooth and flexible. Allows switching out RNGs. But I have no problem with a strictly procedural implementation either.

I was intrigued by the simplicity of this generator and attempted a quick Fortran version. I haven't verified it's correctness yet, but the timings are promising (roughly the same order as the intrinsic random_number subroutines in gfortran and Intel fortran). I have compared the bit sequences of the random integers from this version, and the Fortran wrapper of the C version, and they are equal. The conversion from signed integers to double is still problematic. Portability might also be an issue.

I would suggest preserving the discussion of a random number generator object under issue https://github.com/fortran-lang/stdlib/issues/135.

ivan-pi avatar Dec 17 '20 19:12 ivan-pi

Just for reference here is libstdc++'s implementation of the gaussian variate generator, which is a very standard implementation of Marsaglia's algorithm with each call saving one of the two variates generated for the next call: https://gcc.gnu.org/onlinedocs/libstdc++/latest-doxygen/a15735_source.html#l01783

esterjo avatar Dec 20 '20 04:12 esterjo

Just looking at Thomas, D. B., Luk. W., Leong, P. H. W., and Villasenor, J. D. 2007. Gaussian random number generators. ACM Comput. Surv. 39, 4, Article 11 (October 2007) http://doi.acm.org/10.1145/1287620.1287622

In my code, I use the ratio of uniforms (2 uniforms per single gaussian) algo, but this review suggests a Ziggurat algorithm as accurate (ie especially producing the correct number of extreme deviates), uncorrelated, and 5 times faster than Box-Muller). I have a vague memory of Fortran code for this somewhere.

David-Duffy avatar Dec 21 '20 00:12 David-Duffy

There is a useful discussion of implementing the ziggurat gaussian RNG for numpy here: https://github.com/numpy/numpy/issues/2047

David-Duffy avatar Dec 21 '20 02:12 David-Duffy

In my code, I use the ratio of uniforms (2 uniforms per single gaussian) algo, but this review suggests a Ziggurat algorithm as accurate (ie especially producing the correct number of extreme deviates), uncorrelated, and 5 times faster than Box-Muller). I have a vague memory of Fortran code for this somewhere.

Maybe this one?

epagone avatar Dec 21 '20 12:12 epagone

@David-Duffy you're absolutely right. It appears Ziggurat is the fastest option for gaussians, and doesn't seem to be worst statistically from what I've been reading.

There's a paper on a generalized version to unimodal distributions with unbounded support. According to the authors performance is better than, for example, what is used by GCC's implementation of random.h or in Boost's random.h. The paper gives thorough pseudo-code too, which I appreciate.

EDIT: However, on GPU the reverse seems to hold, because "...in a GPU, the performance of the slow route will apply to all threads in a warp, even if only one thread uses the route. If the warp size is 32 and the probability of taking the slow route is 2 percent, then the probability of any warp taking the slow route is (1 - 0.02)^32, which is 47 percent! So, because of thread batching, the assumptions designed into the ziggurat are violated, and the performance advantage is destroyed."

esterjo avatar Dec 22 '20 21:12 esterjo

"a generalized version" - yes, I had been looking at their C++ code, but dreading how how long it would take to fully test a Fortran port. One reason Boost etc are using time-tested algorithms rather than bleeding edge,

David-Duffy avatar Dec 22 '20 23:12 David-Duffy

The R documents provide a listing of the available probability distributions. Considering another, independent request for better interop with R, it might save some effort to follow the R conventions as much as possible.

Here is the list: https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Probability-distributions

Is there another discussion on what should be in the statistical methods portion of the Fortran Standard Library? The GNU Scientific Library stats section is limited. I also think the R stats package provides a good model of what should be included, but that might be too ambitious. If you have R installed, open up the REPL and type:

library(help="stats") for a list of what is available.

rryley avatar Jan 15 '21 13:01 rryley

Is there another discussion on what should be in the statistical methods portion of the Fortran Standard Library?

@rryley Different PRs were already submitted regarding probability distributions (#271, #272, #273, #26, #278, #286, ...). It seems some of them cover the R probability distributions.

jvdp1 avatar Jan 15 '21 13:01 jvdp1

I want to draw attention in this issue thread to algorithms for geometric and binomial variates published from 2013 through 2015, by Bringmann and colleagues as well as Farach-Colton/Tsai, which were designed especially for small p parameters (geometric) or large numbers of trials (binomial).

See my notes on samplers for both distributions.

REFERENCES:

  • Binomial: K. Bringmann, F. Kuhn, et al., “Internal DLA: Efficient Simulation of a Physical Growth Model.” In: Proc. 41st International Colloquium on Automata, Languages, and Programming (ICALP'14), 2014.
  • Binomial: Farach-Colton, M. and Tsai, M.T., 2015. Exact sublinear binomial sampling. Algorithmica 73(4), pp. 637-651.
  • Geometric: Bringmann, K., and Friedrich, T., 2013, July. Exact and efficient generation of geometric random variates and random graphs, in International Colloquium on Automata, Languages, and Programming (pp. 267-278).

peteroupc avatar Jan 22 '21 03:01 peteroupc