Distributions.jl icon indicating copy to clipboard operation
Distributions.jl copied to clipboard

Undeprecate quantile() for arrays

Open ron-wolf opened this issue 4 years ago • 17 comments

Fixes #1150 by undeprecating omission of the broadcast operator . when using quantile, quantile!, and _quantile!.

For reason, see this StatsBase comment. For full discussion, see JuliaStats/StatsBase.jl#586 and JuliaStats/Distributions.jl#1150.

ron-wolf avatar Aug 17 '20 15:08 ron-wolf

Codecov Report

Merging #1159 into master will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1159   +/-   ##
=======================================
  Coverage   79.91%   79.91%           
=======================================
  Files         115      115           
  Lines        5905     5905           
=======================================
  Hits         4719     4719           
  Misses       1186     1186           
Impacted Files Coverage Δ
src/deprecates.jl 0.00% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 253c3a0...8904276. Read the comment docs.

codecov-commenter avatar Aug 17 '20 16:08 codecov-commenter

Thanks. Can you add a test for the method too?

nalimilan avatar Aug 21 '20 14:08 nalimilan

@nalimilan Sorry, I only just saw this. Where would I add a test? Also, are there any basic methods/tools or best practices for tests in Julia that I might not be aware of?

ron-wolf avatar Oct 02 '20 02:10 ron-wolf

Sorry, I meant testing quantile on a distribution, not on a collection (collections are tested in StatsBase). I'm not sure where to put it, maybe in test/utils.jl?

nalimilan avatar Oct 02 '20 08:10 nalimilan

@ron-wolf Bump. Would you be able to add a tests as suggested in https://github.com/JuliaStats/Distributions.jl/pull/1159#issuecomment-702589660?

andreasnoack avatar Dec 02 '20 08:12 andreasnoack

Bump.

nalimilan avatar Feb 01 '21 08:02 nalimilan

@nalimilan

Sorry, I meant testing quantile on a distribution, not on a collection (collections are tested in StatsBase).

Looking back on this I was testing it on a distribution after all, not a collection. (If you’ll notice, I’m calling iqr(::Any), which dispatches to a non-weighted quantile, wherever that’s defined.) So would this just require a rename of the file and the test case?

ron-wolf avatar Mar 05 '21 02:03 ron-wolf

Actually nevermind, I have completely lost track of what has to be changed and why. Why can't the definition for iqr just be changed to use quantile. instead of quantile? Why are we only undeprecating quantile and not the other implicitly broadcasted operators? In any case, what are the concrete steps involved? “I don’t know” is an acceptable answer.

ron-wolf avatar Mar 05 '21 03:03 ron-wolf

I think I may have found a better solution to the point brought up in this comment?:

# we specify AbstractVector{<:Real} as the type for a discrete 
# distribution because it is the most generic type that describes 
# a sortable  sample: a discrete sequence of ordinal values that 
# can be sampled at certain positions

quantile(x::AbstractVector{<:Real}, p::Real; sorted=false) = let
    # ensure x is sorted, then sample it at the location/input p
    sorted || x = sort!(collect(x))
    ... # do your sampling magic!
end
quantile(x::Any, p::Real) = ...

# dispatch on discrete distributions, both sorted & unsorted
quantile(x::AbstractVector{<:Real}, p::AbstractWeights; sorted=false) = let
    # ensure x is sorted, then sample it at each of the locations/inputs in p
    sorted || x = sort!(collect(x))
    quantile.(Ref(x), p; sorted=true)
end

# a quick re-dispatch to broadcasting; instead of deprecating 
# this, we just silently correct to the right behavior so 
# end-users don't have to worry about the distinction
@inline quantile(x::Any, p::AbstractWeights) = quantile.(Ref(x), p)

# TODO: is there some way to deprecate 
#   quantile.(::AbstractVector{<:Real}, p::AbstractWeights),
# which is now radically inefficient?

which would fix our efficiency woes, while also allowing this simple definition to remain in place:

iqr(x) = let
    q = quantile(x, [.25, .75])
    q[2] - q[1]
end

Are there any other functions which, like you said, have this optimization of sorting and then sampling at each point? If so, perhaps it would be best to enumerate them all in this issue, and then collaborate to write a define-them-all-in-one-go metaprogramming loop.

ron-wolf avatar Mar 05 '21 05:03 ron-wolf

what's the status here?

matbesancon avatar Apr 23 '21 09:04 matbesancon

It seems in Distributions there is no performance improvement if quantile is not broadcasted but would support arrays (correct me if I am wrong). Thus it seems the PR would fix an inconsistency with the use of quantile in StatsBase by introducing an inconsistency with other functions such as pdf, cdf, etc. in Distributions without any real benefit for Distributions. Wouldn't be the most natural solution just be that StatsBase supports arrays (since it improves performance) but Distributions doesn't?

devmotion avatar Apr 23 '21 09:04 devmotion

Thus it seems the PR would fix an inconsistency with the use of quantile in StatsBase by introducing an inconsistency with other functions such as pdf, cdf, etc. in Distributions without any real benefit for Distributions. Wouldn't be the most natural solution just be that StatsBase supports arrays (since it improves performance) but Distributions doesn't?

Actually there would be a benefit for Distribution, as currently a deprecation warning is printed when calling iqr on a distribution object. See https://github.com/JuliaStats/StatsBase.jl/pull/586. The problem is that to be efficient on both arrays and distributions, iqr has to call the vectorized method quantile(x, p) rather than quantile.(x, p). One alternative would be to override the definition of broacast for quantile in StatsBase to call the optimized method under the hood (i.e. to sort data only once).

nalimilan avatar Apr 24 '21 11:04 nalimilan

One alternative would be to override the definition of broacast for quantile in StatsBase to call the optimized method under the hood (i.e. to sort data only once).

Another possible approach would be to add something like

function StatsBase.iqr(d::Distribution)
    q = map(Base.Fix1(quantile, d), (0.25, 0.75))
    return q[2] - q[1]
end

or even just

function StatsBase.iqr(d::Distribution)
    q1 = quantile(d, 0.25)
	q2 = quantile(d, 0.75)
    return q2 - q1
end

to Distributions, it seems?

devmotion avatar Apr 24 '21 11:04 devmotion

Yeah that would be simpler fix. Not sure whether other functions would need the same change.

nalimilan avatar Apr 24 '21 20:04 nalimilan

Wouldn't be the most natural solution just be that StatsBase supports arrays (since it improves performance) but Distributions doesn't? —@devmotion

In the interest of getting this done, I think Step 1 is to figure out what state we want the code to end up in—in a module-agnostic fashion. Step 2, then, would be figuring out what belongs to Distributions vs. StatsBase; this can be considered a secondary concern.

Regarding the actual code to be written, how do we feel about the merits of the two outlines we’ve developed thus far? To reiterate, this code snippet communicates the main thrust of @devmotion’s sketch:

StatsBase.iqr(d::Distribution) = quantile(d, 0.75) - quantile(d, 0.25)

And this was my sketch:

# we specify AbstractVector{<:Real} as the type for a discrete 
# distribution because it is the most generic type that describes 
# a sortable  sample: a discrete sequence of ordinal values that 
# can be sampled at certain positions

quantile(x::AbstractVector{<:Real}, p::Real; sorted=false) = let
    # ensure x is sorted, then sample it at the location/input p
    sorted || x = sort!(collect(x))
    ... # do your sampling magic!
end
quantile(x::Any, p::Real) = ...

# dispatch on discrete distributions, both sorted & unsorted
quantile(x::AbstractVector{<:Real}, p::AbstractWeights; sorted=false) = let
    # ensure x is sorted, then sample it at each of the locations/inputs in p
    sorted || x = sort!(collect(x))
    quantile.(Ref(x), p; sorted=true)
end

# a quick re-dispatch to broadcasting; instead of deprecating 
# this, we just silently correct to the right behavior so 
# end-users don't have to worry about the distinction
@inline quantile(x::Any, p::AbstractWeights) = quantile.(Ref(x), p)

# TODO: is there some way to deprecate 
#   quantile.(::AbstractVector{<:Real}, p::AbstractWeights),
# which is now radically inefficient?

Any thoughts on whether one of these sketches, or a different sketch, would be the best way forward?

ron-wolf avatar Aug 31 '21 19:08 ron-wolf

Regarding the actual code to be written, how do we feel about the merits of the two outlines we’ve developed thus far? To reiterate, this code snippet communicates the main thrust of @devmotion’s sketch:

StatsBase.iqr(d::Distribution) = quantile(d, 0.75) - quantile(d, 0.25)

AFAICT that's the best solution.

nalimilan avatar Aug 31 '21 20:08 nalimilan

Okay. Can someone author a PR over here for me to accept, so as to have the full changeset show up in this PR?

ron-wolf avatar Sep 03 '21 13:09 ron-wolf