StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

alternative characterization for weighted sample quantile

Open salbert83 opened this issue 3 years ago • 26 comments

I am not familiar with the weighted quantile calculation provided by StatsBase.jl. It seems inconsistent with what one would want, for example, in a minimum weighted absolute deviation calculation. Based on the implementation, it seems this is intentional, but I am not sure why? Please review the document and sample code. Thank you. QuantileCalc.pdf quantile_chk.txt

salbert83 avatar Apr 22 '21 23:04 salbert83

See https://github.com/JuliaStats/StatsBase.jl/issues/435 and linked PR. @matthieugomez may be able to answer.

nalimilan avatar Apr 26 '21 20:04 nalimilan

@salbert83 Does your quantile function gives the same thing as Base Julia for weights equal to one?

matthieugomez avatar Apr 26 '21 20:04 matthieugomez

The short answer is no, because the solution to the associated optimization problem might not be unique. I extended the writeup. QuantileCalc.pdf I began looking at this in relation to a rejected PR I'd made to Distributions.jl regarding a MLE fit for a weighted Laplace distribution. The reviewer suggested using the StatsBase weighted quantile calculation, which doesn't return the result one needs in this context. So, I figured to see whether others agreed with this suggested change to the weighted quantile calculation, or should I reopen the other PR? Thanks for your time.

salbert83 avatar Apr 27 '21 01:04 salbert83

There are several ways to extend unweighted quantile for non-frequency weights. The current implementation satisfies these three properties:

  1. equals unweighted quantile for a weight vector of ones
  2. gives the same result after normalizing weights to sum to one
  3. gives the same result after removing elements with weights equal to zero

If you can show me that your definition satisfies these three properties, and that it can also be seen as the solution of a nice minimization problem, I'd be down to use it instead.

matthieugomez avatar Apr 27 '21 15:04 matthieugomez

  1. Please review the PDF, where I examined the 30% unweighted quantile vs my implementation on the integers 1..100.
    2 & 3) I believe my implementation satisfies these requirements, unless my tests haven't uncovered some bug.

The notes elaborate on the associated optimization problem, which may not have a unique solution.

Thanks again for your time.

salbert83 avatar Apr 27 '21 15:04 salbert83

In the pdf, you don't show that 1, 2, 3 are true, you just look at one example.

matthieugomez avatar Apr 27 '21 15:04 matthieugomez

The implementation largely follows the current implementation. A view is taken restricted to observations with nonzero weights, so (3) should be satisfied in all cases. The code should also be invariant to rescaling all the weights by the same positive constant, so (2) should be satisfied. (1) is not satisfied, as demonstrated by the example. In my humble opinion, for this case I find 30 easier to explain than 30.7 for the 30% quantile. Thanks again.

salbert83 avatar Apr 27 '21 16:04 salbert83

Right, 2&3 are satisfied. And I agree with you — I don't particularly like the definition of unweighted quantile used in Julia. But I still think it's important to match it in case weights equal to one.

matthieugomez avatar Apr 27 '21 16:04 matthieugomez

The unweighted quantile is the one from Statistics.jl, right? I see Julia defaults to definition 7 of Hyndman and Fan (1996). It seems definition 4 is more consistent with the results of the optimization (achieved setting alpha = 0, beta =1 in optional params), whereas the Julia defaults for these optional parameters are alpha = beta = 1. I'm guessing since my main interest is in MLE calculations, would you suggest I reopen that PR with the specific point that the Julia weighted quantile calculation is not correct for this case? Thank you.

salbert83 avatar Apr 27 '21 17:04 salbert83

I guess you could open a PR to add a type kwarg to quantile in Statistics, adding the type you want + open a PR in StatsBase to add a type kwarg to quantile with weights too.

matthieugomez avatar Apr 27 '21 21:04 matthieugomez

Statistics.quantile already supports the optional arguments, julia> z = [1:100...]; julia> quantile(z, 0.3, alpha = 1, beta = 1) # These are the defaults if left unspecified 30.7 julia> quantile(z, 0.3, alpha = 0, beta = 1) # This seems consistent with the characterization through optimization. 30.0

Your suggestion then is to add a kwarg to the StatsBase calculation?

salbert83 avatar Apr 27 '21 21:04 salbert83

Oh, I see! So yeah, it'd be terrific to add the kwargs alpha and beta to the weighted quantile function in StatsBase (at least if you find an extension that makes sense, in particular, satisfying the requirements 1 2 3 given above).

matthieugomez avatar Apr 27 '21 22:04 matthieugomez

I implemented the MLE consistent calculation using an optional kwarg. It doesn't cover the full range regarding alpha and beta settings available in Statistics.quantile, but does provide an option for those requiring this for MLE. I refactored the implementation so both methods share as much code as possible. Please review. Thanks.

salbert83 avatar Apr 28 '21 11:04 salbert83

Even if you don't support arbitrary alpha and beta values, can you use the same keyword argument names as in the unweighted function (throwing errors for values that are not supported)?

nalimilan avatar Apr 28 '21 12:04 nalimilan

While I think alpha = 0 and beta = 1 are consistent with the optimization approach, I think it would be misleading. Statistics.quantile supports arbitrary alpha and beta in [0,1] as parameters controlling interpolation. The point of the kwarg here is to indicate its consistency with MLE.

salbert83 avatar Apr 28 '21 14:04 salbert83

Sorry, closed accidentally.

salbert83 avatar Apr 28 '21 14:04 salbert83

Why would it be misleading? That would make the different quantile definitions more consistent. And if in the future somebody requests supporting another definition, we don't want to add multiple keyword arguments.

nalimilan avatar Apr 28 '21 15:04 nalimilan

I did some further investigation to see whether this could work. Unfortunately, my previous speculation was incorrect, as demonstrated by the following examples. Consider the following 2 cases, where the function "f" is the cost function from my earlier notes.

Example 1 q = 0.5 z = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] w = [1.0, 1.0, 1.0, 1.0, 1.9, 0.1, 1.0, 1.0, 1.0, 1.0] StatsBase.quantile(z, Weights(w), [q]) |> only = 4.7894736842105265 f(StatsBase.quantile(z, Weights(w), [q]) |> only) = 12.260526315789473 StatsBase.quantile(z, Weights(w), [q], mle = true) |> only = 5.0 f(StatsBase.quantile(z, Weights(w), [q], mle = true) |> only) = 12.05 Statistics.quantile(z, q, alpha = 0, beta = 1) = 5.0
Statistics.quantile(z, q, alpha = 1, beta = 0) = 6.0

Example 2: q = 0.5 z = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] w = [1.0, 1.0, 1.0, 1.0, 0.1, 1.9, 1.0, 1.0, 1.0, 1.0] StatsBase.quantile(z, Weights(w), [q]) |> only = 5.7368421052631575 f(StatsBase.quantile(z, Weights(w), [q]) |> only) = 12.286842105263158 StatsBase.quantile(z, Weights(w), [q], mle = true) |> only = 6.0 f(StatsBase.quantile(z, Weights(w), [q], mle = true) |> only) = 12.05 Statistics.quantile(z, q, alpha = 0, beta = 1) = 5.0 Statistics.quantile(z, q, alpha = 1, beta = 0) = 6.0

In Example 1, alpha = 0 and beta = 1 is consistent with the optimization approach. In Example 2, alpha = 1 and beta = 0 is consistent. In both these cases, the minima is unique and there is no ambiguity in the value that is selected by optimization. I believe selecting according to this optimization criteria doesn't quite fit the current framework of selecting interpolation parameters.

salbert83 avatar Apr 28 '21 18:04 salbert83

Here are some examples with equal weights showing selection by optimization is different from choosing alpha and beta parameters for interpolation:

z = [0.1576177442664426, 0.22588785878291495, 0.3472306214052947, 0.484404897858993, 0.5295720233336718, 0.5706659909590641, 0.6204132900748311, 0.6264793742353929, 0.9104324596368851, 0.9773671697505484] w = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] StatsBase.quantile(z, Weights(w), [q]) |> only = 0.4968710244900043 f(StatsBase.quantile(z, Weights(w), [q]) |> only) = 0.9475584471334264 StatsBase.quantile(z, Weights(w), [q], mle = true) |> only = 0.484404897858993 f(StatsBase.quantile(z, Weights(w), [q], mle = true) |> only) = 0.9430706415462624 Statistics.quantile(z, q, alpha = 0, beta = 1) = 0.43502215833566155 f(Statistics.quantile(z, q, alpha = 0, beta = 1)) = 0.9746755948411945 Statistics.quantile(z, q, alpha = 1, beta = 0) = 0.5133118581627875 f(Statistics.quantile(z, q, alpha = 1, beta = 0)) = 0.9534771472556283 Statistics.quantile(z, q, alpha = 1, beta = 1) = 0.4968710244900043 f(Statistics.quantile(z, q, alpha = 1, beta = 1)) = 0.9475584471334264

salbert83 avatar Apr 29 '21 13:04 salbert83

Thanks for checking. So you mean that this is yet another variant that isn't supported by the unweighted method in Statistics? Is it implemented in other statistical packages? If we add support for the weighted variant, we should probably support the unweighted one too (in Statistics).

@matthieugomez What do you think?

nalimilan avatar May 02 '21 21:05 nalimilan

I'm not sure whether this is implemented in other statistical packages. I'm not a statistician by trade, so I can't claim expertise here. My motivation for this pull request started here: https://github.com/JuliaStats/Distributions.jl/pull/1310#issuecomment-824879815 The weighted absolute deviation calculation arises if, for example, one uses Laplace distributions as part of a GMM or HMM. I'd made a submission regarding this to Distributions.jl, but was informed what was available (the weighted median) was good enough. This relied on the current weighted quantile method, which doesn't actually match the MLE for the distribution. I disagreed with the reviewer's assessment, but it did seem reasonable to see whether there was interest in a general weighted quantile calculation that could then be leveraged for my use case of interest. Thanks again.

salbert83 avatar May 04 '21 00:05 salbert83

OK. Please do have a look at what other languages do, as it's generally a good idea to try using the same vocabulary for argument names if possible.

@devmotion Do you have an opinion about this?

nalimilan avatar May 04 '21 07:05 nalimilan

I saw the following Python package, but I'm not sure how widely it is used, https://github.com/nudomarinero/wquantiles. My personal opinion is to design around use cases. My current use case is a correct implementation for parameter estimation for a weighted Laplace distribution. By correct, I mean returning a value which maximizes the likelihood function, noting that this need not uniquely determine the value. To resolve non-uniqueness, I imagine users would want to have a prior, in which case they would probably want to handle the details themselves.

salbert83 avatar May 04 '21 10:05 salbert83

I am not completely sure, are you mainly interested in my opinion about keyword argument names?

I just skimmed through the discussion quickly but I think it mixes up different ways to define (weighted) quantiles, based on the (smoothed) cumulative distribution function and based on quantile regression.

As mentioned before by others, I think any implementation of weighted quantiles in StatsBase has to be consistent with the unweighted implementation in Statistics. I would go even further than property 1 above and say that it would be desirable that it

1+: equals unweighted quantile with replicated samples for integer weights

It would be nice to add supports for parameter alpha and beta to the implementation in StatsBase - in my opinion, ideally the implementation in StatsBase would just be a generalization of the one in Statistics and satisfy all consistency properties mentioned above.

To me it seems, it would be better to implement estimates based on (weighted) quantile regression in a separate function.

devmotion avatar May 04 '21 11:05 devmotion

I'm not sure how to proceed here. My end goal is to have a weighted Laplace fit (similar to weighted fits for other distributions, that returns a correct answer. If the internal implementation needs to leverage the StatsBase weighted median function, then this would need to change. But then that causes inconsistency with the unweighted quantile calculations in Statistics. It seems to me the way to proceed is to introduce a separate weighted quantile regression (per the previous suggestion), and then submit a pull request for Laplace that leverages this function. If there's general agreement, I'll reimplement. Thanks.

salbert83 avatar May 04 '21 13:05 salbert83

It seems to me the way to proceed is to introduce a separate weighted quantile regression (per the previous suggestion), and then submit a pull request for Laplace that leverages this function.

Yes, this would be my suggestion. What do you think @nalimilan?

devmotion avatar May 08 '21 11:05 devmotion