Statistics.jl
Statistics.jl copied to clipboard
What to port from StatsBase
This issue is to discuss what functions should be ported from StatsBase to Statistics (#2). Some functions would better move to a separate package:
- statmodels.jl: should go to StatsAPI.jl
Most APIs have passed the test of time so they are probably good enough, but I find some of them are not completely satisfying:
- hist.jl: I don't know this part of the code enough to judge whether the API is OK. There have been proposals to move these to a separate package (https://github.com/JuliaStats/StatsBase.jl/issues/650).
- weights.jl: Weighted
sum
cannot be implemented via aweights
keyword arguments like other functions since the function lives in Base (https://github.com/JuliaLang/julia/pull/33310). We could either exportwsum
or keep it internal and do not support it for now. - counts.jl:
counts
sounds a bit too generic of a term for a function that only allows counting integer values.countmap
is more general and its name is explicit. That said,counts
could easily be extended to allow any type of levels -- its limitation is just that it returns a vector without names so the mapping to the levels has to be done by hand, which isn't user-friendly. APIs provided by FreqTables.jl are nicer to use, but they need NamedArrays.jl (or a similar package). Then there's the issue thatcountmap
uses radix sort for performance with some types, but this needs SortingAlgorithms.jl, which isn't a stdlib (yet?). - deviation.jl: Do we really need all of these small convenience functions?
counteq
andcountne
don't really sound like statistical functions and I'm not sure how commonly they are used.sqL2dist
,L2dist
,L1dist
,Linfdist
have an uppercase in their name; these and remaining functions are redundant with functions provided in Distances.jl. That only leavespsnr
. - misc.jl:
indexmap
is justindexin
so remove it.levelsmap
andindicatormat
sound a bit limited compared with what StatsModels provides.rle
andinverse_rle
are not really related to statistics. - scalarstats.jl:
mean_and_var
andmean_and_std
have weird names so I'm not sure we should keep them or not.zscore
andzscore!
are convenient but redundant with (more general and more verbose) functions in transformations.jl. - transformations.jl:
transform
andtransform!
are too generic names, I propose overloading LinearAlgebra'snormalize
andnormalize!
, since that name is actually the commonly used term for such transformations. I wonder whether we really needreconstruct
andreconstruct!
(which could be calledunnormalize
if we keep them). I'm also not sure what's the use of allowing a separatefit
operation before actually applying the transformation (I'd imagine one would always normalize the data immediately). - moments.jl:
moment
is redundant with specific functions so I'd drop it. - robust.jl:
trimvar(x)
could bevar(trim(x))
iftrim(x)
returned a special iterator type to dispatch on
See also my previous notes at https://github.com/JuliaLang/julia/pull/27152#issuecomment-390687566.
Let me ask top-down, as I was not involved much in earlier discussions. Is the plan the following:
- move whatever makes sense in Stabistics.jl there
- move API-kind definitions to StatsAPI.jl
- leave in StatsBase.jl all the things that do not fit points 1 and 2 above so that old codes keep working ?
Also then the question is when we move things from StatsBase.jl to Statistics.jl should we allow (and thus discuss) API changes or we only port things? If we allow API changes then probably we need to keep in StatsBase.jl some deprecations for the old API.
Finally (again - I do not want to mess things up too much), it would be great to start moving things to Statistics.jl with a clear statement how we want to handle:
- missing values
- weighting
in a consistent manner across all functions (it might have been discussed but I think it would be beneficial for many - for sure for me - to have it clearly stated, so then when I review the PR I can check every single function if it conforms to the assumed API). The point is that once we move things to Statistics.jl there will be no way to fix things before Julia 2.0 release.
Let me ask top-down, as I was not involved much in earlier discussions. Is the plan the following:
1. move whatever makes sense in Statistics.jl there 2. move API-kind definitions to StatsAPI.jl 3. leave in StatsBase.jl all the things that do not fit points 1 and 2 above so that old codes keep working?
Yes. Though for 3 we will have to carefully assess how we can reexport functions from Statistics to avoid conflicts on recent Julia versions.
Also then the question is when we move things from StatsBase.jl to Statistics.jl should we allow (and thus discuss) API changes or we only port things? If we allow API changes then probably we need to keep in StatsBase.jl some deprecations for the old API.
I'm afraid we'll have to apply some API changes (as noted above for some cases). So yeah adding deprecations to StatsBase would be nice for users, though not absolutely required in the short term as nobody will be forced to switch to Statistics (as long as we avoid/fix conflicts with StatsBase).
Finally (again - I do not want to mess things up too much), it would be great to start moving things to Statistics.jl with a clear statement how we want to handle:
* missing values * weighting
in a consistent manner across all functions (it might have been discussed but I think it would be beneficial for many - for sure for me - to have it clearly stated, so then when I review the PR I can check every single function if it conforms to the assumed API). The point is that once we move things to Statistics.jl there will be no way to fix things before Julia 2.0 release.
Good point.
For weights, I think the majority want to use the weights
keyword argument, which makes it possible to accept any AbstractVector
in most functions (the exception being var
where the type of weights is needed to choose the correction, and quantile
). There have been counter-proposals at https://github.com/JuliaLang/julia/pull/33310 to use a more general weighted(x, w)
approach. But I don't think that's super useful in the end as weighting is generally not just multiplying values by weights (that's only the case for sums and means, but not e.g. for quantile
or var
), so it would only be used for dispatch.
Missing values are a more tricky issue. For unweighted single-argument functions, f(skipmissing(x))
works well. For multiple-argument functions like cor
we don't have a great solution yet, but this is orthogonal to importing StatsBase. What isn't orthogonal to StatsBase is how to skip missing values with weighted functions. Let's discuss this at https://github.com/JuliaLang/Statistics.jl/issues/88 as it's quite complex.
countmap
, the name is not so lucky, the counts(x, k::Integer, [wv::AbstractWeights])
("If an integer k is provided, only values in the range 1:k will be considered.") is a bit non-sensical, but I think a good counts should be part of the statistics package, giving the user a vector of pairs or a table for every input where unique
is meaningful and a version of it applying a binning map first.
What would you suggest then? Have counts(x)
equivalent to countmap
, and keep a counts(x, k)
method? Whether the result of counts(x)
is a vector of pairs or a dict could be specified via counts(Dict/Vector, x)
. (Unfortunately we can't return a table as we would need to depend on something like NamedArrays.)
Applying a binning map is what histograms do, right?
counts(Dict/Vector, x).
That reads right.
Applying a binning map is what histograms do, right?
The actual work in histograms is to find the binning map, right? Otherwise it could really just be count(to_bin(range), Dict/Vector, x)
with to_bin(range)
giving a function which assigns each x
the right bin, say as Pair(a,b)
of a sorted vector or StepRange
argument.
OK so basically we could also add a counts(f, x)
and/or counts(f, Dict/Vector, x)
method that would call f
on each element before "counting" it. Then the API for a to_bin
function can be developed separately.
One more thing
rle
andinverse_rle
are not really related to statistics.
I think the keyword is "average run length", which is important for statistics and the reason why rle
found a place in StatsBase
. It's a bit of a riddle how to get arl
from rle
to me at the moment though.
I tend to think in general it would be better to move things from Statistics to StatsBase (or some other package). For posterity a copy of the discussion on Slack:
David Widmann: I guess this has been discussed at length but exactly this issue makes me wonder if it is a good idea at all to move StatsBase to Statistics. I assume development would slow down, it would take more time until bugfixes are available to users, and breaking changes would not be possible until Julia 2.0. What is the main advantage of moving StatsBase to Statistics? That it seems more official to users and they don't have to load both Statistics and StatsBase? ... Alejandro Merchan: I agree with @devmotion, I thought the idea always has been to get things out of Base and into packages. Why that change of mind? Michael K. Borregaard: Really stable APIs live well in stdlibs, it means people can really rely on them in their development David Widmann: semver should already protect you from breaking changes :man_shrugging: In my experience, even if the API seems stable, it is really useful to have the possibility to update and improve it if one realizes at a later point that some choice was suboptimal. And if something lives in a stdlib one can't just make a bugfix release but has to wait for the next Julia release and it might never be available in older Julia versions if it is not backported (correct me if I'm wrong). So it seems not only the API but ideally also the implementation itself should be perfect if something becomes part of a stdlib. Michael K. Borregaard: That sounds like an argument against stdlibs in general, rather than an argument for not putting the most used statistical methods in an stdlib now we have them? StatsBase is one of the oldest and most solid julia packages out there, with 100s of deps and a very stable api Milan Bouchet-Valat: One reason to move things to Statistics is that we can't add a weights keyword argument to mean, quantile, etc. from a package. Also in terms of user friendlyness having all basic functions in a single place seems better than having just a very restricted set in the stdlib. Dilum Aluthge: You can't add weights keywords to Statistics.mean, etc. But you could make separate functions StatsBase.mean, etc. such that Statistics.mean and StatsBase.mean are different generic functions. Then add weights keywords to the StatsBase.mean function. Dilum Aluthge: As a general rule, I think we should be moving things out of stdlibs, not moving more things into stdlibs.One exception to this general rule would be things that are required for Pkg to work. Those things need to be in stdlibs, because Pkg cannot have non-stdlib dependencies. (edited) Dilum Aluthge:
Once we move StatsBase functions to Statistics we won't be able to change them for a long time. In short: speak now or hold your peace until 2.0.
I think this is a good reason to avoid moving functionality to a stdlib. Julia 2.0 is not going to happen for a very long time. (edited) David Widmann:
That sounds like an argument against stdlibs in general, rather than an argument for not putting the most used statistical methods in an stdlib now we have them?
Sure, the argument applies more generally, whenever you consider moving something to a stdlib. But this does not invalidate it in this case, it applies to the StatsBase/Statistics discussion as well.
StatsBase is one of the oldest and most solid julia packages out there, with 100s of deps and a very stable api
StatsBase seems quite active, there are many open issues and PRs, and there were already 8 releases this year. So even if the API is stable (which I don't know for sure), the implementation seems to be updated and improved quite regularly which would be more difficult in a stdlib.
I think starting with the weights
ones makes a lot of sense.
My only additional suggestion is to figure out a standard interface for multivariate stats (e.g. I would still love cov(eachrow(X))
/cov(eachcol(X))
, but this depends on https://github.com/JuliaLang/julia/pull/32310)
I generally agree with @devmotion's point on keeping this out of stdlib to make it easier to maintain it.
One concrete point regarding mean_and_var, this is very commonly used for computational efficiency reasons in GPs, see for example https://github.com/JuliaGaussianProcesses/AbstractGPs.jl/blob/master/src/exact_gpr_posterior.jl, so would be good to keep the compound forms!
@nalimilan - I have checked what would be a problem when merging StatsBase.jl into Statistics.jl because of external dependencies:
DataAPI:
- describe (extended)
DataStructures:
- heapify!, heappop!, percolate_down! (used for sampling)
LogExpFunctions:
- xlogx, xlogy (needed in entropy and cross entropy)
Missings:
- disallowmissing (used in pariwise)
- missings (used in ranking)
SortingAlgorithms:
- RadixSort (used in countmap)
StatsAPI: a massive dependency (I did not perform check - probably we would need to merge StatsAPI into StatsBase?)
Thanks. DataAPI and StatsAPI are the most problematic, as even if Statistics is part of the stdlib, some packages may not want to depend on it as loading it may take some time (in particular if we move lots of StatsBase code there). So making these depend on Statistics could be a problem.
-
StatsAPI is actually not used a lot: the modeling API is merely reexported for backward compatiblity reasons and doesn't need to be ported to Statistics; only
pairwise
andpairwise!
are defined in StatsBase. It's used notably by Distances, which already depends on Statistics. Unfortunately, removingpairwise
from StatsAPI would require tagging a breaking release, but maybe that would be OK if we also adjust compat bounds in the registry for packages that don't usepairwise
. -
Regarding
describe
in DataAPI, maybe we could avoid porting it to Statistics and only keepsummarystats
(whose definition is cleaner anyway since it returns a usable object instead of just printing it)? -
Heap functions from DataStructures seem relatively short so we could copy them even if that's not ideal.
-
Same for
xlogx
andxlogy
from LogExpFunctions, as these are trivial one-liners. -
Same for
disallowmissing
andmissings
from Missings, which are just convenience wrappers around Base functions. -
RadixSort
from SortingAlgorithms (which is unmaintained) can be replaced with the radix sort implementation used bysort
in Base by default for some types (https://github.com/JuliaStats/StatsBase.jl/pull/720.)
An idea what to do:
- if we copy the functions you commented on (like
xlogx
,missings
etc.) into Statistics.jl its load time (from what was in StatsBase.jl) would go down, as the extra dependencies would not be loaded. - then maybe we could make DataAPI.jl and StatsAPI.jl depend on Statistics.jl? Then we would define in Statistics.jl only what is used there and make Statistics.jl an owner of these functions (and DataAPI.jl and StatsAPI.jl would just expose them). Another idea could be that DataAPI.jl and StatsAPI.jl could have Statistics.jl as a conditional dependency then?
Yeah we would need to check the load time of Statistics and see whether it affects that of e.g. Distances or not. I'm not sure how a conditional dependency would work, given that packages that currently overload functions defined in DataAPI or StatsAPI do not necessarily load Statistics, so that would be breaking (and if we break we may as well require such packages to overload the method defined in Statistics).
so that would be breaking
Why would that be breaking? We do not need to load Statistics
into scope (e.g. StatsAPI.jl could just load it internally). Also - it was a part of Julia distribution, so I guess packages were expected to implement non-conflicting functionality.
I mean that if a package wants to overload e.g. StatsAPI.pairwise
, currently it doesn't need to depend on nor load Statistics. If we make Statistics a conditional dependency of StatsAPI, StatsAPI.pairwise
will only be defined after loading Statistics, right?
Or do you have a trick in mind which would allow merging StatsAPI.pairwise
and Statistics.pairwise
when Statistics is loaded?
will only be defined after loading Statistics, right?
You are right. I thought that this is highly unlikely, but maybe you are right that we should also safeguard ourselves against this.
Or do you have a trick in mind
I do not think this is possible.
But then the question is - maybe Statistics.jl could depend on StatsAPI.jl in the end? Do you think it would be a problem?
I guess that would imply adding it to the stdlib. It wouldn't be add significant overhead but it would be the only stdlib which doesn't actually provide any features usable on their own. It would also make it harder to tag StatsAPI 2.0 if needed at some point.
Hopefully we can simply make StatsAPI depend on Statistics.