Distributions.jl
Distributions.jl copied to clipboard
Decouple `rand` and `eltype`
One of my take-aways from issues such as #1252, #1902, #894, #1783, #1041, and #1071 is that eltype is not only quite inconsistently implemented and unclearly documented currently but more generally it might be a bad design decision to use it for pre-allocating containers of samples: Setting it to a fixed type (historically Float64 for continuous and Int for discrete distributions) is too limiting, but trying to infer it from parameters is challenging and doomed to fail for more challenging scenarios that are not as clear as e.g. Normal.
This PR tries to decouple rand and eltype, to make it easier to possibly eventually deprecate and remove eltype.
Fixes #1041. Fixes #1071. Fixes #1082. Fixes #1252. Fixes #1783. Fixes #1884. Fixes #1902. Fixes #1907.
Codecov Report
:x: Patch coverage is 83.25243% with 69 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 85.56%. Comparing base (0421b18) to head (b813560).
Additional details and impacted files
@@ Coverage Diff @@
## master #1905 +/- ##
==========================================
- Coverage 86.36% 85.56% -0.81%
==========================================
Files 146 146
Lines 8786 8851 +65
==========================================
- Hits 7588 7573 -15
- Misses 1198 1278 +80
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Given that, according to the docstring
https://github.com/JuliaStats/Distributions.jl/blob/a1010e408dbbdd4bcfd9c11eef189df64f7bb05a/src/common.jl#L96-L102
the sole purpose of eltype is to return the type of rand, which I agree isn't really feasible, should we also deprecate the methods here as part of this PR?
#1907 is related.
https://github.com/JuliaStats/Distributions.jl/issues/1907 is related.
https://github.com/JuliaStats/Distributions.jl/pull/1905/commits/17154a2c1e38050f3c864b3e93f1b8cd496f45a8 fixes it.
My personal opinion:
rand(d::Distribution) should remain a valid method (I don't think anyone is trying to change this), and its return type should be predictable by knowing what d is. It would also be useful to have a method that tells you what type rand(d::Distribution) returns without running rand(d::Distribution). eltype(d::Distribution) is logical for this.
This is orthogonal to the fact that Distributions should probably support rand(d::Distribution, T::Type).
In an ideal situation, I think we would define rand(d::Distribution, T::Type) for distributions (to allow for potential type-specific optimizations and evade unnecessary typecasts) and then dispatch rand(d::Distribution) = rand(d, eltype(d)).
its return type should be predictable by knowing what d is
I became convinced that in general this is not possible. It basically means trying to manually figure out type inference, a task that even the Julia compiler can fail at. Of course, for some number types and distributions it can be done (as also the compiler will succeed in many instances). But in general it's far from trivial. As in the initial stages of Distributions, one could be restrictive and limit samples to Int and Float but that's IMO not desirable either. Therefore I think it's best to remove eltype (which IMO has also been a weird name since I don't view distributions as containers or collections) from the API. IMO it's just too brittle and currently to inconsistent. Of course, that doesn't prevent users from trying to figure out the (element) type of samples with eg the help of the compiler, it's just not an official feature anymore.
A few distributions still give me Float64 after this PR, while others work fine:
julia> dist = Semicircle(50.0f0)
Semicircle{Float32}(r=50.0f0)
julia> rand(dist, 5)
5-element Array{Float64, 1}:
36.80671953487414
-18.355635129900335
-11.701855436648922
-21.444118928985656
-5.80120463505302
julia> dist = JohnsonSU(0.0f0, 1.0f0, 0.0f0, 1.0f0)
JohnsonSU{Float32}(ξ=0.0f0, λ=1.0f0, γ=0.0f0, δ=1.0f0)
julia> rand(dist, 5)
5-element Array{Float64, 1}:
0.5311696707298299
1.632313034117999
0.04951771555318912
0.4721610259428258
-3.052321854866766
julia> dist = Chisq(5.0f0)
Chisq{Float32}(ν=5.0f0)
julia> rand(dist, 5)
5-element Array{Float32, 1}:
15.465032
1.888659
7.013455
4.258529
3.9611576
Can this be fixed?
The reason is that for quite a few of the older, probably less used and definitely less frequently updated distributions the rand implementation contains hardcoded Float64, either as literals or as calls of rand(rng) or randn(rng). I fixed the SemicircleandJohnsonSU`, but I'm not sure if it's possible (or even desirable) to address all of these in a single PR (it's already a quite massive undertaking). Maybe it would be best to open issues for others and address them in follow-up PRs (most (all?) other such changes in this PR are included to fix open issues).
This is not the first time I have went on a rant about how Distributions in this package are effectively unordered containers with infinite size, but I have stumbled upon new ammunition: https://docs.julialang.org/en/v1/stdlib/Random/#A-simple-sampler-without-pre-computed-data
This section of the Julia stdlib documentation claims:
Given a collection type
S, it's currently assumed that ifrand(::S)is defined, an object of typeeltype(S)will be produced. In the last example, aVector{Any}is produced; the reason is thateltype(Die) == Any. The remedy is to defineBase.eltype(::Type{Die}) = Int.
This led me to https://github.com/JuliaLang/julia/pull/31787 and https://github.com/JuliaLang/julia/pull/27756
I believe that it is actually deeper Random.jl precedent that typeof(rand(s)) == eltype(s). Do we really want to break this precedent in Distributions.jl?
Additional rambling
On a similar note, I do not think that it is controversial that eltype(d) should also be the type returned by minimum(d), maximum(d), mean(d), mode(d), etc., following Base precedent.
I also believe that it is reasonable that quantile(d, x) should return eltype(d); it is analogous to getindex to some extent. When would it make sense for rand and quantile to return different types? In other words, I believe that the following behavior is not ideal:
julia> d = Normal(0f0, 1f0)
Normal{Float32}(μ=0.0f0, σ=1.0f0)
julia> typeof(quantile(d, .5)) == typeof(rand(d))
false
For many distribution types, perhaps the real conversation we should be having is if we should force partype(D == eltype(D). For distributions parameterized by their extrema, e.g. uniform, I think this is very reasonable. Uniform(0f0, 1f0) is basically saying "construct an infinite unordered container with a uniform distribution with minimum == 0f0, maximum == 1f0."
For distributions parameterized by mean/std, similar logic definitely holds. It is not a stretch to extend this for other location/spread parameters. Shape parameters are ambiguous, but if someone types Beta(1f0, 1f0), I would suspect that they have a good reason for it.
For distributions like Binomial, I would argue that the type of n is more useful than the type of p. Unfortunately, the type of n is currently hardcoded as Int while the type of p is user-defined.
Yet further rambling
cdf is a strange cousin of indexin and findfirst, which I believe are guaranteed to return keytype. Actually implementing an explicit keytype(d) == Float64) may be a reach though, as I suppose this also implies getindex support, which quantile is not quite identical to.
eltype issues described here are also quite painful when using parameters of Dual type in a distribution d. This makes rand(d, 10) throw due to pre-allocation based on eltype (which is often Float64), as well as eltype just being plain incorrect (doing rand(d)) since Duals actually get returned.
Sorry, late to the party, but with iterators we have a "HasEltype" attribute to address exactly this issue, how is this design related?
I also believe that it is reasonable that
quantile(d, x)should returneltype(d); it is analogous togetindexto some extent. When would it make sense forrandandquantileto return different types?
Sometimes quantiles of integer-valued distributions are taken to interpolate between the integers flanking the 50% value of the cumulative. E.g. the median of an unbiased Bernoulli distribution would be 0.5 in this convention. Just saying.