Statistics.jl What to port from StatsBase

This issue is to discuss what functions should be ported from StatsBase to Statistics (#2). Some functions would better move to a separate package:

statmodels.jl: should go to StatsAPI.jl

Most APIs have passed the test of time so they are probably good enough, but I find some of them are not completely satisfying:

hist.jl: I don't know this part of the code enough to judge whether the API is OK. There have been proposals to move these to a separate package (https://github.com/JuliaStats/StatsBase.jl/issues/650).
weights.jl: Weighted sum cannot be implemented via a weights keyword arguments like other functions since the function lives in Base (https://github.com/JuliaLang/julia/pull/33310). We could either export wsum or keep it internal and do not support it for now.
counts.jl: counts sounds a bit too generic of a term for a function that only allows counting integer values. countmap is more general and its name is explicit. That said, counts could easily be extended to allow any type of levels -- its limitation is just that it returns a vector without names so the mapping to the levels has to be done by hand, which isn't user-friendly. APIs provided by FreqTables.jl are nicer to use, but they need NamedArrays.jl (or a similar package). Then there's the issue that countmap uses radix sort for performance with some types, but this needs SortingAlgorithms.jl, which isn't a stdlib (yet?).
deviation.jl: Do we really need all of these small convenience functions? counteq and countne don't really sound like statistical functions and I'm not sure how commonly they are used. sqL2dist, L2dist, L1dist, Linfdist have an uppercase in their name; these and remaining functions are redundant with functions provided in Distances.jl. That only leaves psnr.
misc.jl: indexmap is just indexin so remove it. levelsmap and indicatormat sound a bit limited compared with what StatsModels provides. rle and inverse_rle are not really related to statistics.
scalarstats.jl: mean_and_var and mean_and_std have weird names so I'm not sure we should keep them or not. zscore and zscore! are convenient but redundant with (more general and more verbose) functions in transformations.jl.
transformations.jl: transform and transform! are too generic names, I propose overloading LinearAlgebra's normalize and normalize!, since that name is actually the commonly used term for such transformations. I wonder whether we really need reconstruct and reconstruct! (which could be called unnormalize if we keep them). I'm also not sure what's the use of allowing a separate fit operation before actually applying the transformation (I'd imagine one would always normalize the data immediately).
moments.jl: moment is redundant with specific functions so I'd drop it.
robust.jl: trimvar(x) could be var(trim(x)) if trim(x) returned a special iterator type to dispatch on

See also my previous notes at https://github.com/JuliaLang/julia/pull/27152#issuecomment-390687566.

Sep 27 '21 07:09 nalimilan

Let me ask top-down, as I was not involved much in earlier discussions. Is the plan the following:

move whatever makes sense in Stabistics.jl there
move API-kind definitions to StatsAPI.jl
leave in StatsBase.jl all the things that do not fit points 1 and 2 above so that old codes keep working ?

Also then the question is when we move things from StatsBase.jl to Statistics.jl should we allow (and thus discuss) API changes or we only port things? If we allow API changes then probably we need to keep in StatsBase.jl some deprecations for the old API.

Finally (again - I do not want to mess things up too much), it would be great to start moving things to Statistics.jl with a clear statement how we want to handle:

missing values
weighting

in a consistent manner across all functions (it might have been discussed but I think it would be beneficial for many - for sure for me - to have it clearly stated, so then when I review the PR I can check every single function if it conforms to the assumed API). The point is that once we move things to Statistics.jl there will be no way to fix things before Julia 2.0 release.

Sep 27 '21 09:09 bkamins

Let me ask top-down, as I was not involved much in earlier discussions. Is the plan the following:
1. move whatever makes sense in Statistics.jl there

2. move API-kind definitions to StatsAPI.jl

3. leave in StatsBase.jl all the things that do not fit points 1 and 2 above so that old codes keep working?

Yes. Though for 3 we will have to carefully assess how we can reexport functions from Statistics to avoid conflicts on recent Julia versions.

Also then the question is when we move things from StatsBase.jl to Statistics.jl should we allow (and thus discuss) API changes or we only port things? If we allow API changes then probably we need to keep in StatsBase.jl some deprecations for the old API.

I'm afraid we'll have to apply some API changes (as noted above for some cases). So yeah adding deprecations to StatsBase would be nice for users, though not absolutely required in the short term as nobody will be forced to switch to Statistics (as long as we avoid/fix conflicts with StatsBase).

Finally (again - I do not want to mess things up too much), it would be great to start moving things to Statistics.jl with a clear statement how we want to handle:
* missing values

* weighting
in a consistent manner across all functions (it might have been discussed but I think it would be beneficial for many - for sure for me - to have it clearly stated, so then when I review the PR I can check every single function if it conforms to the assumed API). The point is that once we move things to Statistics.jl there will be no way to fix things before Julia 2.0 release.

Good point.

For weights, I think the majority want to use the weights keyword argument, which makes it possible to accept any AbstractVector in most functions (the exception being var where the type of weights is needed to choose the correction, and quantile). There have been counter-proposals at https://github.com/JuliaLang/julia/pull/33310 to use a more general weighted(x, w) approach. But I don't think that's super useful in the end as weighting is generally not just multiplying values by weights (that's only the case for sums and means, but not e.g. for quantile or var), so it would only be used for dispatch.

Missing values are a more tricky issue. For unweighted single-argument functions, f(skipmissing(x)) works well. For multiple-argument functions like cor we don't have a great solution yet, but this is orthogonal to importing StatsBase. What isn't orthogonal to StatsBase is how to skip missing values with weighted functions. Let's discuss this at https://github.com/JuliaLang/Statistics.jl/issues/88 as it's quite complex.

Sep 27 '21 10:09 nalimilan

countmap, the name is not so lucky, the counts(x, k::Integer, [wv::AbstractWeights]) ("If an integer k is provided, only values in the range 1:k will be considered.") is a bit non-sensical, but I think a good counts should be part of the statistics package, giving the user a vector of pairs or a table for every input where unique is meaningful and a version of it applying a binning map first.

Sep 28 '21 15:09 mschauer

What would you suggest then? Have counts(x) equivalent to countmap, and keep a counts(x, k) method? Whether the result of counts(x) is a vector of pairs or a dict could be specified via counts(Dict/Vector, x). (Unfortunately we can't return a table as we would need to depend on something like NamedArrays.)

Applying a binning map is what histograms do, right?

Sep 28 '21 15:09 nalimilan

counts(Dict/Vector, x).

That reads right.

Applying a binning map is what histograms do, right?

The actual work in histograms is to find the binning map, right? Otherwise it could really just be count(to_bin(range), Dict/Vector, x) with to_bin(range) giving a function which assigns each x the right bin, say as Pair(a,b) of a sorted vector or StepRange argument.

Sep 28 '21 19:09 mschauer

OK so basically we could also add a counts(f, x) and/or counts(f, Dict/Vector, x) method that would call f on each element before "counting" it. Then the API for a to_bin function can be developed separately.

Sep 28 '21 20:09 nalimilan

One more thing

rle and inverse_rle are not really related to statistics.

I think the keyword is "average run length", which is important for statistics and the reason why rle found a place in StatsBase. It's a bit of a riddle how to get arl from rle to me at the moment though.

Sep 29 '21 10:09 mschauer

I tend to think in general it would be better to move things from Statistics to StatsBase (or some other package). For posterity a copy of the discussion on Slack:

David Widmann: I guess this has been discussed at length but exactly this issue makes me wonder if it is a good idea at all to move StatsBase to Statistics. I assume development would slow down, it would take more time until bugfixes are available to users, and breaking changes would not be possible until Julia 2.0. What is the main advantage of moving StatsBase to Statistics? That it seems more official to users and they don't have to load both Statistics and StatsBase? ... Alejandro Merchan: I agree with @devmotion, I thought the idea always has been to get things out of Base and into packages. Why that change of mind? Michael K. Borregaard: Really stable APIs live well in stdlibs, it means people can really rely on them in their development David Widmann: semver should already protect you from breaking changes :man_shrugging: In my experience, even if the API seems stable, it is really useful to have the possibility to update and improve it if one realizes at a later point that some choice was suboptimal. And if something lives in a stdlib one can't just make a bugfix release but has to wait for the next Julia release and it might never be available in older Julia versions if it is not backported (correct me if I'm wrong). So it seems not only the API but ideally also the implementation itself should be perfect if something becomes part of a stdlib. Michael K. Borregaard: That sounds like an argument against stdlibs in general, rather than an argument for not putting the most used statistical methods in an stdlib now we have them? StatsBase is one of the oldest and most solid julia packages out there, with 100s of deps and a very stable api Milan Bouchet-Valat: One reason to move things to Statistics is that we can't add a weights keyword argument to mean, quantile, etc. from a package. Also in terms of user friendlyness having all basic functions in a single place seems better than having just a very restricted set in the stdlib. Dilum Aluthge: You can't add weights keywords to Statistics.mean, etc. But you could make separate functions StatsBase.mean, etc. such that Statistics.mean and StatsBase.mean are different generic functions. Then add weights keywords to the StatsBase.mean function. Dilum Aluthge: As a general rule, I think we should be moving things out of stdlibs, not moving more things into stdlibs.One exception to this general rule would be things that are required for Pkg to work. Those things need to be in stdlibs, because Pkg cannot have non-stdlib dependencies. (edited) Dilum Aluthge:

Once we move StatsBase functions to Statistics we won't be able to change them for a long time. In short: speak now or hold your peace until 2.0.

I think this is a good reason to avoid moving functionality to a stdlib. Julia 2.0 is not going to happen for a very long time. (edited) David Widmann:

That sounds like an argument against stdlibs in general, rather than an argument for not putting the most used statistical methods in an stdlib now we have them?

Sure, the argument applies more generally, whenever you consider moving something to a stdlib. But this does not invalidate it in this case, it applies to the StatsBase/Statistics discussion as well.

StatsBase is one of the oldest and most solid julia packages out there, with 100s of deps and a very stable api

StatsBase seems quite active, there are many open issues and PRs, and there were already 8 releases this year. So even if the API is stable (which I don't know for sure), the implementation seems to be updated and improved quite regularly which would be more difficult in a stdlib.

Oct 01 '21 19:10 devmotion

I think starting with the weights ones makes a lot of sense.

My only additional suggestion is to figure out a standard interface for multivariate stats (e.g. I would still love cov(eachrow(X))/cov(eachcol(X)), but this depends on https://github.com/JuliaLang/julia/pull/32310)

Oct 01 '21 22:10 simonbyrne

I generally agree with @devmotion's point on keeping this out of stdlib to make it easier to maintain it.

One concrete point regarding mean_and_var, this is very commonly used for computational efficiency reasons in GPs, see for example https://github.com/JuliaGaussianProcesses/AbstractGPs.jl/blob/master/src/exact_gpr_posterior.jl, so would be good to keep the compound forms!

Oct 06 '21 12:10 st--

@nalimilan - I have checked what would be a problem when merging StatsBase.jl into Statistics.jl because of external dependencies:

DataAPI:

describe (extended)

DataStructures:

heapify!, heappop!, percolate_down! (used for sampling)

LogExpFunctions:

xlogx, xlogy (needed in entropy and cross entropy)

Missings:

disallowmissing (used in pariwise)
missings (used in ranking)

SortingAlgorithms:

RadixSort (used in countmap)

StatsAPI: a massive dependency (I did not perform check - probably we would need to merge StatsAPI into StatsBase?)

Sep 01 '23 17:09 bkamins

Thanks. DataAPI and StatsAPI are the most problematic, as even if Statistics is part of the stdlib, some packages may not want to depend on it as loading it may take some time (in particular if we move lots of StatsBase code there). So making these depend on Statistics could be a problem.

StatsAPI is actually not used a lot: the modeling API is merely reexported for backward compatiblity reasons and doesn't need to be ported to Statistics; only pairwise and pairwise! are defined in StatsBase. It's used notably by Distances, which already depends on Statistics. Unfortunately, removing pairwise from StatsAPI would require tagging a breaking release, but maybe that would be OK if we also adjust compat bounds in the registry for packages that don't use pairwise.
Regarding describe in DataAPI, maybe we could avoid porting it to Statistics and only keep summarystats (whose definition is cleaner anyway since it returns a usable object instead of just printing it)?
Heap functions from DataStructures seem relatively short so we could copy them even if that's not ideal.
Same for xlogx and xlogy from LogExpFunctions, as these are trivial one-liners.
Same for disallowmissing and missings from Missings, which are just convenience wrappers around Base functions.
RadixSort from SortingAlgorithms (which is unmaintained) can be replaced with the radix sort implementation used by sort in Base by default for some types (https://github.com/JuliaStats/StatsBase.jl/pull/720.)

Sep 02 '23 16:09 nalimilan

An idea what to do:

if we copy the functions you commented on (like xlogx, missings etc.) into Statistics.jl its load time (from what was in StatsBase.jl) would go down, as the extra dependencies would not be loaded.
then maybe we could make DataAPI.jl and StatsAPI.jl depend on Statistics.jl? Then we would define in Statistics.jl only what is used there and make Statistics.jl an owner of these functions (and DataAPI.jl and StatsAPI.jl would just expose them). Another idea could be that DataAPI.jl and StatsAPI.jl could have Statistics.jl as a conditional dependency then?

Sep 02 '23 21:09 bkamins

Yeah we would need to check the load time of Statistics and see whether it affects that of e.g. Distances or not. I'm not sure how a conditional dependency would work, given that packages that currently overload functions defined in DataAPI or StatsAPI do not necessarily load Statistics, so that would be breaking (and if we break we may as well require such packages to overload the method defined in Statistics).

Sep 03 '23 14:09 nalimilan

so that would be breaking

Why would that be breaking? We do not need to load Statistics into scope (e.g. StatsAPI.jl could just load it internally). Also - it was a part of Julia distribution, so I guess packages were expected to implement non-conflicting functionality.

Sep 04 '23 06:09 bkamins

I mean that if a package wants to overload e.g. StatsAPI.pairwise, currently it doesn't need to depend on nor load Statistics. If we make Statistics a conditional dependency of StatsAPI, StatsAPI.pairwise will only be defined after loading Statistics, right?

Or do you have a trick in mind which would allow merging StatsAPI.pairwise and Statistics.pairwise when Statistics is loaded?

Sep 04 '23 07:09 nalimilan

will only be defined after loading Statistics, right?

You are right. I thought that this is highly unlikely, but maybe you are right that we should also safeguard ourselves against this.

Or do you have a trick in mind

I do not think this is possible.

But then the question is - maybe Statistics.jl could depend on StatsAPI.jl in the end? Do you think it would be a problem?

Sep 04 '23 10:09 bkamins

I guess that would imply adding it to the stdlib. It wouldn't be add significant overhead but it would be the only stdlib which doesn't actually provide any features usable on their own. It would also make it harder to tag StatsAPI 2.0 if needed at some point.

Hopefully we can simply make StatsAPI depend on Statistics.

Sep 04 '23 13:09 nalimilan

Statistics.jl Statistics.jl copied to clipboard

What to port from StatsBase

Statistics.jl
Statistics.jl copied to clipboard