array-api RFC: add `quantile`?

I'm working on adding array API support in scipy.stats (scipy/scipy#20544) and one of the the things I'll need is a quantile function. If there is some support for this idea, I'll convert this issue into a proper proposal.

Looks like there is already wide support:

Previous discussions (not much):

https://github.com/data-apis/array-api/issues/10#issuecomment-656228898
https://github.com/data-apis/array-api/issues/232#issuecomment-884454859

There are many conventions for calculating quantiles. Only a few methods would be required by the standard, and if choice of a default is too contentious, perhaps the array-API can consider method to be a required keyword argument, and libraries would be welcome to keep their own default.

Apr 21 '24 21:04 mdhaber

Thanks for the proposal @mdhaber. I don't quite have an opinion yet - I think it in part depends on the situation with methods (see below).

Also - how much do you actually need this? I only count two instances of it being used in SciPy, one of which is a test case. I just did a grep, so I may be missing some dynamic usage perhaps. The one instance is pretty simple, no keyword usage:

https://github.com/scipy/scipy/blob/d55cb95282b5cfab381e373d3c127630c7e915a0/scipy/stats/tests/test_fit.py#L1002

Only a few methods would be required by the standard,

Which ones? Could we get away with only a default method, and hence no method keyword?

and if choice of a default is too contentious, perhaps the array-API can consider method to be a required keyword argument, and libraries would be welcome to keep their own default.

Apr 23 '24 09:04 rgommers

Related is our previous discussion on median, which we left out due to implementation difficulty in distributed contexts. That stated, Dask does implement median, so may be time to revisit this.

Apr 24 '24 00:04 kgryte

Also - how much do you actually need this?

Personally, I don't need it very badly. You're right that quantile/percentile only appear in a few stats functions (stats.bootstrap, stats.iqr, and some distribution fitting code), and it is easy to compute quantiles using existing array-API functions namely (sort). However, when only a few percentiles are needed, it is more efficient to use a partition approach. One could propose adding partition, since it is more general, but I would only use it for quantile.

Which ones? Could we get away with only a default method, and hence no method keyword? The standard reference is Sample Quantiles in Statistical Packages.

The current variation in sample quantile definitions causes confusion, and so there is a need to standardize the definition of sample quantile across packages and within packages... and we propose that $Q_8$ is the best choice

$Q_8$ corresponds with NumPy's median_unbiased. However, $Q_7$ (linear) is the default in most libraries. It would be unfortunate to have to choose between the two. $Q_5$ (hazen) also satisfies many desirable properties. So I suppose I would suggest having these three in that order of priority.

One wrench I'd like to throw into my own proposal: The current behavior of np.quantile(a, q, axis) (and other implementations that follow NumPy) is for the quantiles specified by q, a 1D array-like, to be computed for all axis-slices of the data a. This makes it impossible to efficiently compute a different quantile for each axis-slice of a (which is needed, for example, in a vectorized implementation of the BCA-bootstrap). I would propose following normal broadcasting rules between a and q (possibly with the constraint that q must be 0D or have length 1 along axis), which would meet both needs. A hybrid possibility, which would just be an extension of the existing behavior, would be to allow q to be an N-D array that is broadcastable with a except along axis. In this case, for each pair of corresponding axis-slices a_i and q_i, it would compute all quantiles of a_i specified by q_i.

Apr 24 '24 17:04 mdhaber

I did a bit of digging, with the disclaimer that my digging is incomplete, but I'll try to summarize my initial findings below.

Overview

I took a peak at which APIs were implemented across array libraries, spot-checked whether/how they were implemented, and did a search for which APIs were used in SciPy and sklearn.

median

Usage

SciPy: low usage; more in tests, than in implementations
scikit-learn: a reasonable amount of usage

Implementations

NumPy implements in terms of partition
CuPy implements as an element-wise kernel
PyTorch returns the lower of two medians for an even number of elements, rather than the average (NumPy returns the avg to avoid breaking astropy); to return an average, must call quantile.
Dask supports median
TensorFlow does not appear to have a dedicated median API, instead deferring to percentile

quantile

Usage

SciPy: low usage
scikit-learn: limited usage; mainly in tests

Implementations

NumPy implements in terms of partition
CuPy relies on sort
- only supports a subset of methods: lower, higher, midpoint, nearest, and linear
PyTorch supports the same methods as CuPy
Dask supports in DataFrames, but not arrays
TensorFlow implements in terms of percentile
- supports the same methods as CuPy
- API differs in that you specify the number of quantiles you want

percentile

Usage

SciPy: low usage
scikit-learn: a reasonable amount of usage

Implementations

NumPy implements in terms of quantile
PyTorch does not support
CuPy implements in terms of quantile
- supports only linear, lower, higher, nearest, and midpoint methods
Dask supports "approximate" percentiles
- supports same methods as CuPy
TensorFlow implements in terms of sort
- supports same methods as CuPy

partition

Implementations

NumPy: https://numpy.org/doc/stable/reference/generated/numpy.partition.html
CuPy supports: https://docs.cupy.dev/en/stable/reference/generated/cupy.partition.html
PyTorch does not support
Dask does not support
TensorFlow does not support

May 02 '24 15:05 kgryte

I reviewed scipy.stats more thoroughly for uses of median, quantile, and/or percentile. The first three functions would benefit from having quantile in the array API; for the rest, median would suffice.

scipy.stats.bootstrap uses percentile here, but I would be happy to change it to use quantile. It would benefit from a quantile function with improved vectorization as described above.
scipy.stats.iqr uses percentile here, but I'd be happy to change it to use quantile.
scipy.stats.epps_singleton_2samp uses iqr here.
scipy.stats.siegelslopes is currently implemented using Pythran, but I think it can be reasonably fast without Pythran if we wanted to use other backends. (Could Pythran dispatch to other backends in the future?) It uses median here and again outside the loop.
scipy.stats.theilslopes uses median here.
scipy.stats.median_test, as one might imagine, uses median here.
scipy.stats.levene uses median here.
scipy.stats.fligner uses median here.
scipy.stats.mstats.sen_seasonal_slopes uses np.ma.median here, but one could argue that this should be implemented in stats with regular np.median.

There are also uses of partition in stats.trim functions (trim_mean, trimboth, trim1).

Most of these functions have something else that would make array API conversion challenging at the moment, but I don't think any have non-starters for array API support. I've omitted uses I don't think will get array API support any time soon (e.g. levy_stable). There are also uses in other sub-packages (e.g. scipy.ndimage.median_filter), but I'm less familiar with their prospects for array API support.

May 09 '24 02:05 mdhaber

Ideally it would be nice to have support for weights and filtering missing values automatically.

NumPy 2 recently added support for weighted percentiles/quantiles via the weights kwarg:

https://numpy.org/devdocs/reference/generated/numpy.quantile.html#numpy-quantile

Adding support for nan value filtering is being discussed in #621.

However, this is not necessarily easy to handle both weights and missing values (EDIT: actually it mostly means forcing the weight of nan entries to zero before computing the cumsum of weights of the sorted observations). There is a related pull request in NumPy here:

https://github.com/numpy/numpy/pull/26582

In scikit-learn we maintain our own _weighted_percentile helper to deal with this and it's used in several places of our code base.

Jun 06 '24 14:06 ogrisel

In my experience, quantiles are very important in practice and occur all over the places.

Which ones? Could we get away with only a default method, and hence no method keyword? The standard reference is Sample Quantiles in Statistical Packages.

The current variation in sample quantile definitions causes confusion, and so there is a need to standardize the definition of sample quantile across packages and within packages... and we propose that Q8 is the best choice

Q8 corresponds with NumPy's median_unbiased. However, Q7 (linear) is the default in most libraries. It would be unfortunate to have to choose between the two. Q5 (hazen) also satisfies many desirable properties. So I suppose I would suggest having these three in that order of priority.

I could imagine to let the default method vary (for easier consensus), but each one must have a linear option (most have it already). For weighted quantiles in particular, method inverted_cdf would be mandatory.

Oct 11 '24 10:10 lorentzenchr

For weighted quantiles in particular, method inverted_cdf would be mandatory.

I would also advocate for averaged_inverted_cdf in addition to inverted_cdf, which makes it possible to get a strict equivalence between unweighted quantiles (typically computed with the linear strategy by default) and unit-weighted quantiles on even sample sizes.

Jan 17 '25 08:01 ogrisel

@rgommers you were concerned about methods. I wanted to check whether it was clear that these methods differ only in the choice of two numbers, as described in NumPy's quantile documentation, so they are not really distinct implementations (as e.g. Harwell-Davis quantiles would be).

If it helps, the methods I suggested initially can be parametrized by two floats in $[0, 1]$, so one option would be to expose those floats as as parameters (like mstats.mquantiles) does. For the rest, the two numbers are functions of the quantile q and number of observations n, so it's easier to attach names to them.

OTOH, at this point I'm leaning toward adding an implementation to SciPy in terms of array API functions, the key operations being sort and take_along_axis IIRC. It would probably be slower than a backend-specific implementation in some cases because it would always do a full sort, but I suppose we always have the option of delegating. It still might be worth having a standard for median, though. I understand the concern with calculating that in a distributed context, but sort is already in the standard.

Jan 17 '25 14:01 mdhaber

adding an implementation to SciPy in terms of array API functions

@mdhaber You might want to contact/coordinate with scikit-learn for that. (Apart from numpy 2.0 not yet being the minimal dependency) The array API compatibility is meanwhile one of the main motivations to have our own implementation of it, see the recent PR https://github.com/scikit-learn/scikit-learn/pull/29034.

All technical details aside, what is the right approach to move this RFC forward? How/Where is the decision making body, etc.?

Jan 17 '25 14:01 lorentzenchr

That's fine if scikit-learn needs its own private function. We'll need something (public or private) in SciPy anyway. I opened scipy/scipy#22352 about that.

Jan 17 '25 22:01 mdhaber

Linking topk (or top_k) discussion as this is a somewhat related question: https://github.com/data-apis/array-api/issues/629

Note with Dask users are often recommended to use something like topk as median, quantile, etc. requiring sorting large data, which can be complicated

Jan 18 '25 02:01 jakirkham

scipy/scipy#22352 added scipy.stats.quantile, which is tested with NumPy, CuPy, and PyTorch right now, and we can probably add support for the other backends in time. I think this will meet the ecosystem's needs for median and quantiles. Performance might be improved with argpartition (gh-629), so I'll close this in favor of that.

Mar 03 '25 22:03 mdhaber

@lucascolley Out of interest: Who decided when that this is not planned?

Mar 30 '25 18:03 lorentzenchr

we can reopen if you think discussion should continue! I closed it more so as 'stale' than 'not planned', both of which are less misleading than 'completed' IMO

Mar 30 '25 19:03 lucascolley

I opened the issue, and my concern is resolved (even if the resolution is in another library), so I'll close again. Anyone is welcome to open a new issue and link to this one.

Mar 30 '25 19:03 mdhaber

Okay, I edited the title to avoid misleading that quantile was added to the standard

Mar 30 '25 19:03 lucascolley

If it's not added, and there is no plan to add it, what's wrong with "not planned"?

Mar 30 '25 19:03 leofang

I don't have a problem with that, either. It is also accurate to say that it is "not planned".

Perhaps there is confusion about the difference between "not planned" and "decided against adding"? Looking at the discussion above, although the possibility wasn't entirely ruled out, there is no plan to add it.

Mar 30 '25 19:03 mdhaber

Perhaps there is confusion about the difference between "not planned" and "decided against adding"?

It looks like we can use the label https://github.com/data-apis/array-api/labels/status%3A%20Rejected to make this distinction

Mar 31 '25 19:03 lucascolley

array-api array-api copied to clipboard

RFC: add `quantile`?

Overview

median

Usage

Implementations

quantile

Usage

Implementations

percentile

Usage

Implementations

partition

Implementations

array-api
array-api copied to clipboard