Distributions.jl
Distributions.jl copied to clipboard
Undeprecate quantile() for arrays
Fixes #1150 by undeprecating omission of the broadcast operator .
when using quantile
, quantile!
, and _quantile!
.
For reason, see this StatsBase
comment. For full discussion, see JuliaStats/StatsBase.jl#586 and JuliaStats/Distributions.jl#1150.
Codecov Report
Merging #1159 into master will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## master #1159 +/- ##
=======================================
Coverage 79.91% 79.91%
=======================================
Files 115 115
Lines 5905 5905
=======================================
Hits 4719 4719
Misses 1186 1186
Impacted Files | Coverage Δ | |
---|---|---|
src/deprecates.jl | 0.00% <ø> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 253c3a0...8904276. Read the comment docs.
Thanks. Can you add a test for the method too?
@nalimilan Sorry, I only just saw this. Where would I add a test? Also, are there any basic methods/tools or best practices for tests in Julia that I might not be aware of?
Sorry, I meant testing quantile
on a distribution, not on a collection (collections are tested in StatsBase). I'm not sure where to put it, maybe in test/utils.jl?
@ron-wolf Bump. Would you be able to add a tests as suggested in https://github.com/JuliaStats/Distributions.jl/pull/1159#issuecomment-702589660?
Bump.
@nalimilan
Sorry, I meant testing
quantile
on a distribution, not on a collection (collections are tested in StatsBase).
Looking back on this I was testing it on a distribution after all, not a collection. (If you’ll notice, I’m calling iqr(::Any)
, which dispatches to a non-weighted quantile
, wherever that’s defined.) So would this just require a rename of the file and the test case?
Actually nevermind, I have completely lost track of what has to be changed and why. Why can't the definition for iqr
just be changed to use quantile.
instead of quantile
? Why are we only undeprecating quantile
and not the other implicitly broadcasted operators? In any case, what are the concrete steps involved? “I don’t know” is an acceptable answer.
I think I may have found a better solution to the point brought up in this comment?:
# we specify AbstractVector{<:Real} as the type for a discrete
# distribution because it is the most generic type that describes
# a sortable sample: a discrete sequence of ordinal values that
# can be sampled at certain positions
quantile(x::AbstractVector{<:Real}, p::Real; sorted=false) = let
# ensure x is sorted, then sample it at the location/input p
sorted || x = sort!(collect(x))
... # do your sampling magic!
end
quantile(x::Any, p::Real) = ...
# dispatch on discrete distributions, both sorted & unsorted
quantile(x::AbstractVector{<:Real}, p::AbstractWeights; sorted=false) = let
# ensure x is sorted, then sample it at each of the locations/inputs in p
sorted || x = sort!(collect(x))
quantile.(Ref(x), p; sorted=true)
end
# a quick re-dispatch to broadcasting; instead of deprecating
# this, we just silently correct to the right behavior so
# end-users don't have to worry about the distinction
@inline quantile(x::Any, p::AbstractWeights) = quantile.(Ref(x), p)
# TODO: is there some way to deprecate
# quantile.(::AbstractVector{<:Real}, p::AbstractWeights),
# which is now radically inefficient?
which would fix our efficiency woes, while also allowing this simple definition to remain in place:
iqr(x) = let
q = quantile(x, [.25, .75])
q[2] - q[1]
end
Are there any other functions which, like you said, have this optimization of sorting and then sampling at each point? If so, perhaps it would be best to enumerate them all in this issue, and then collaborate to write a define-them-all-in-one-go metaprogramming loop.
what's the status here?
It seems in Distributions there is no performance improvement if quantile
is not broadcasted but would support arrays (correct me if I am wrong). Thus it seems the PR would fix an inconsistency with the use of quantile
in StatsBase by introducing an inconsistency with other functions such as pdf
, cdf
, etc. in Distributions without any real benefit for Distributions. Wouldn't be the most natural solution just be that StatsBase supports arrays (since it improves performance) but Distributions doesn't?
Thus it seems the PR would fix an inconsistency with the use of
quantile
in StatsBase by introducing an inconsistency with other functions such ascdf
, etc. in Distributions without any real benefit for Distributions. Wouldn't be the most natural solution just be that StatsBase supports arrays (since it improves performance) but Distributions doesn't?
Actually there would be a benefit for Distribution, as currently a deprecation warning is printed when calling iqr
on a distribution object. See https://github.com/JuliaStats/StatsBase.jl/pull/586. The problem is that to be efficient on both arrays and distributions, iqr
has to call the vectorized method quantile(x, p)
rather than quantile.(x, p)
. One alternative would be to override the definition of broacast
for quantile
in StatsBase to call the optimized method under the hood (i.e. to sort data only once).
One alternative would be to override the definition of broacast for quantile in StatsBase to call the optimized method under the hood (i.e. to sort data only once).
Another possible approach would be to add something like
function StatsBase.iqr(d::Distribution)
q = map(Base.Fix1(quantile, d), (0.25, 0.75))
return q[2] - q[1]
end
or even just
function StatsBase.iqr(d::Distribution)
q1 = quantile(d, 0.25)
q2 = quantile(d, 0.75)
return q2 - q1
end
to Distributions, it seems?
Yeah that would be simpler fix. Not sure whether other functions would need the same change.
Wouldn't be the most natural solution just be that StatsBase supports arrays (since it improves performance) but Distributions doesn't? —@devmotion
In the interest of getting this done, I think Step 1 is to figure out what state we want the code to end up in—in a module-agnostic fashion. Step 2, then, would be figuring out what belongs to Distributions vs. StatsBase; this can be considered a secondary concern.
Regarding the actual code to be written, how do we feel about the merits of the two outlines we’ve developed thus far? To reiterate, this code snippet communicates the main thrust of @devmotion’s sketch:
StatsBase.iqr(d::Distribution) = quantile(d, 0.75) - quantile(d, 0.25)
And this was my sketch:
# we specify AbstractVector{<:Real} as the type for a discrete
# distribution because it is the most generic type that describes
# a sortable sample: a discrete sequence of ordinal values that
# can be sampled at certain positions
quantile(x::AbstractVector{<:Real}, p::Real; sorted=false) = let
# ensure x is sorted, then sample it at the location/input p
sorted || x = sort!(collect(x))
... # do your sampling magic!
end
quantile(x::Any, p::Real) = ...
# dispatch on discrete distributions, both sorted & unsorted
quantile(x::AbstractVector{<:Real}, p::AbstractWeights; sorted=false) = let
# ensure x is sorted, then sample it at each of the locations/inputs in p
sorted || x = sort!(collect(x))
quantile.(Ref(x), p; sorted=true)
end
# a quick re-dispatch to broadcasting; instead of deprecating
# this, we just silently correct to the right behavior so
# end-users don't have to worry about the distinction
@inline quantile(x::Any, p::AbstractWeights) = quantile.(Ref(x), p)
# TODO: is there some way to deprecate
# quantile.(::AbstractVector{<:Real}, p::AbstractWeights),
# which is now radically inefficient?
Any thoughts on whether one of these sketches, or a different sketch, would be the best way forward?
Regarding the actual code to be written, how do we feel about the merits of the two outlines we’ve developed thus far? To reiterate, this code snippet communicates the main thrust of @devmotion’s sketch:
StatsBase.iqr(d::Distribution) = quantile(d, 0.75) - quantile(d, 0.25)
AFAICT that's the best solution.
Okay. Can someone author a PR over here for me to accept, so as to have the full changeset show up in this PR?