julia icon indicating copy to clipboard operation
julia copied to clipboard

Remove Statistics stdlib and copy `mean`, `std` and `var` from it

Open nalimilan opened this issue 1 year ago • 51 comments

This finishes the removal of the Statistics stdlib started at https://github.com/JuliaLang/julia/pull/45594, to make it a separate package. We will then be able to merge Statistics and StatsBase to avoid splitting basic stats functions across multiple modules.

Since there are concerns that users will complain about the need to install a package just to compute the mean and standard deviation, the PR also copies the code from mean, std and var and exports these functions from Base. This is yet one more step in a saga started by https://github.com/JuliaLang/julia/pull/27152. I'm not too happy about this since the choice of functions that live in Base is kind of arbitrary. In particular, its not great that weighted methods defined in StatsBase (to be merged with Statistics) are documented in a different place from the unweighted ones, and we had plans to make weights a keyword argument, which makes dispatching on weights types defined in StatsBase tricky if functions are defined in Base. But well... at least the situation after the PR is less confusing than having both Statistics and StatsBase.

Code is just copied from Statistics, the only changes are:

  • add compatibility note to docstrings
  • remove references to Statistics
  • change tests to use the mean keyword argument instead of stdm/varm
  • do not test sparse matrices as they are not available in Base

nalimilan avatar Aug 26 '22 21:08 nalimilan

In my opinion, adding 438 lines of statistics code to Base kind of defeats the point of excising the Statistics stdlib. Because we'll still have 438 lines of code that we can't make breaking changes to, etc.

For the most common use cases, e.g. mean(f, arr), is it really unreasonable to expect users to just write sum(f.(arr))/length(arr) themselves? And for anything more complicated, shouldn't they just install the Statistics package anyway?

There's even less friction for adding packages now, since doing import Foo or using Foo in the REPL would automatically prompt the user, and they can just press Enter to install the package.

DilumAluthge avatar Aug 27 '22 06:08 DilumAluthge

Also, what makes mean, std, and var special enough that they need to be included in Base, but other functions don't need to be?

DilumAluthge avatar Aug 27 '22 06:08 DilumAluthge

I mostly agree. We could also just remove Statistics for now and see how strongly people complain. As you say, with Pkg improvements, installing an external package now only requires typing y after using Statistics so it's not a big effort to ask even from newcomers.

That said the previous attempt ended up being reverted (see https://github.com/JuliaLang/julia/issues/27374) so we kind of know that people do complain. The situation may be a bit different now given that even sparse matrix support has been moved out of the stdlib.

nalimilan avatar Aug 27 '22 11:08 nalimilan

@nalimilan - thank you for your effort.

Could you please summarize:

  1. the design of the ecosystem of Base/Statistics.jl/StatsBase.jl in Julia 1.9 release.
  2. the target design of the Base/Statistics.jl/StatsBase.jl ecosystem (if it is different than the one for 1.9 release)
  3. How do we expect users to write code that would work under: Julia 1.6 LTS, Julia 1.8, Julia 1.9 and target (if target is different). In particular how would existing code using Statistics.jl and StatsBase.jl be affected?

Thank you!

bkamins avatar Aug 27 '22 12:08 bkamins

People starting to use sum(x)/length(x) could be an indicator of friction caused by moving mean to the periphery

mschauer avatar Aug 27 '22 12:08 mschauer

People starting to use sum(x)/length(x)

This pattern as a replacement of mean is plain incorrect as it will not work for x that does not have O(1) cost of computing length as in such cases it will be undefined.

bkamins avatar Aug 27 '22 12:08 bkamins

mean, var and std are considered common functions, and not special for statistics. People have a very strong objection to needing to install a package for mean, especially. I personally tend to agree with that view myself.

Sparse matrix is still significantly far less common than mean. Also the sparse package is not out of stdlib yet, and has been brought back into the system image until we have conditional dependencies.

ViralBShah avatar Aug 27 '22 12:08 ViralBShah

I'll concede that mean(a::AbstractArray) and var(a::AbstractArray) (where the latter would be returning the unbiased sample variance) would be considered common methods to most people. If this PR only added those two methods, I'll withdraw my objection.

But this PR adds a lot of other methods as well. It's harder for me to buy the argument that all of those methods would be considered common/non-statistical to most people.

I'm a little concerned about the slippery slope here. If we add those methods, then why not add weights? It's not clear to me why some functionality gets to be included, and other functionality is excluded.

DilumAluthge avatar Aug 27 '22 12:08 DilumAluthge

On a separate note, I'm not sure why we need to add std(a) to Base - users can't do sqrt(var(a))?

DilumAluthge avatar Aug 27 '22 12:08 DilumAluthge

While waiting for a comment by @nalimilan about the long-term plan, I think that if we agree mean([1, 2, 3]) should be included then also e.g. something like mean(i for i in 1:3 if i > 1.5) should be included. The cases when length is not defined are pretty common.

bkamins avatar Aug 27 '22 12:08 bkamins

I think I can probably be talked into adding just mean(iterable) and var(iterable), with support for arbitrary iterables, including unknown length.

But with a long-term plan that makes it explicit that we won't be adding any additional methods (including positional args or keyword args) to mean and var.

DilumAluthge avatar Aug 27 '22 12:08 DilumAluthge

Also, I'd be happier with variance instead of var, but that can be infinitely bikeshed, and that's not the hill I'm going to die on.

DilumAluthge avatar Aug 27 '22 12:08 DilumAluthge

We can bring the weights in as well. I would not be opposed to that, but I suspect that will need to bring in a whole lot of other machinery around weights that is better managed outside.

Numpy actually goes much further in what it considers common functions and has a whole statistics module: https://numpy.org/doc/stable/reference/routines.statistics.html. Now numpy is itself a python package, so it is a little bit apples and oranges.

ViralBShah avatar Aug 27 '22 13:08 ViralBShah

BTW, the one thing that I am not sure of is, that with the package manager improvements in the last 4 years, will people have the same reaction to mean not being in Base as they did in 2018. I suspect there will be lesser opposition, but there will still be quite a lot of discontent.

What if Base had, say, average, and we then have everything else (mean, std, var) in Statistics?

ViralBShah avatar Aug 27 '22 13:08 ViralBShah

What is average here? Just the regular unweighted mean of an array or iterable? E.g. average(arr::AbstractArray) or average(iterable)?

DilumAluthge avatar Aug 27 '22 13:08 DilumAluthge

Could you please summarize:

1. the design of the ecosystem of Base/Statistics.jl/StatsBase.jl in Julia 1.9 release.

2. the target design of the Base/Statistics.jl/StatsBase.jl ecosystem (if it is different than the one for 1.9 release)

3. How do we expect users to write code that would work under: Julia 1.6 LTS, Julia 1.8, Julia 1.9 and target (if target is different). In particular how would existing code using Statistics.jl and StatsBase.jl be affected?

@bkamins As I see it, the goal is to have Statistics as a standalone package to which we will be able to move most or all of the StatsBase features (details to be discussed, we probably don't want to import deprecated API), and to deprecate StatsBase in the long term. For Julia 1.9 probably the only change will be that Statistics is no longer an stdlib module but a separate package, as importing things from StatsBase will take some work.

Nothing should break for users of StatsBase and Statistics, except for having to install Statistics and add Statistics = "1" compat version bound to Project.toml (which will be done automatically in the registry like https://github.com/JuliaRegistries/General/pull/66039/commits/c56dc97da98c1cb5c05671837ac500ecded805b9). But since Statistics cannot be upgraded on older Julia versions, packages will likely continue to use StatsBase for a long time (at least until the next LTS is out). Actually it could make sense to have StatsBase merely reexports functions from Statistics on recent Julia versions, a bit like Compat does for Julia Base. The exact implementation still needs reflection, but I don't think we need to sort it out to make a decision regarding this PR.

nalimilan avatar Aug 27 '22 13:08 nalimilan

I believe JuliaCon 2023 could be the right timeframe for the next LTS. Hopefully we have conditional dependencies in place by then, and can even move out SparseArrays and perhaps a few other things as well for the new LTS.

ViralBShah avatar Aug 27 '22 13:08 ViralBShah

@DilumAluthge @ViralBShah The reason why I think weighted stats should not be included is that it would pull in a whole machinery of weight vectors types which really cannot live in Base as they are too specialized. That's not a problem for mean as we could just have a keyword argument taking any AbstractVector (including weight vectors), but for std and var the correction factor depends on the kind of weight so the types would have to live in Base too.

A better comparison than Numpy is probably Python's statistics standard library:

The module is not intended to be a competitor to third-party libraries such as NumPy, SciPy, or proprietary full-featured statistics packages aimed at professional statisticians such as Minitab, SAS and Matlab. It is aimed at the level of graphing and scientific calculators.

It is very similar to our current Statistics stdlib, but it also provides mode, harmonic and geometric mean, and basic linear regression. So if we took Python as a reference we would favor the status quo. :-) But the situation isn't ideal in Python as NumPy provides separate definitions, so you have multiple functions to do the same thing.

Conversely, Rust, Go and Swift don't provide any statistical functions in their standard libraries, not even mean.

nalimilan avatar Aug 27 '22 13:08 nalimilan

@ViralBShah - is changing a module in which some name is defined (mean, mean!, var, and std in the case of this PR) considered non-breaking?

bkamins avatar Aug 27 '22 15:08 bkamins

I use basic statistical functions in my work in signal processing all the time, but I am not a professional statistician. I am not concerned about having to import a package to use mean, var, median and other similar functions. What I am concerned about, though, is the future direction of the Statistics package and the impllications for people like me who just need basic functionality and whose data is mostly represented as regular, plain vectors and matrices.

Recent discussions over in Discourse (see for example here) make it clear that the desire of the professional statisticians in the community is to adapt Statistics.jl to meet their needs. Among other things, there is a clear reluctance to continue to support functions that take anything other than tables. Such changes would make it difficult and inconvenient for me (and I suspect, many others) to use that package in my own projects.

My concrete proposal is to excise statistics from stdlib if you must, but give a different name to the new package (maybe BasicStats or ClassicStats [or NonSeriousStats :wink: ]). Then the serious statisticians would be free to drive Statistics.jl forward, while the rest of us would continue to use the more basic functions.

mbaz avatar Aug 27 '22 17:08 mbaz

@ViralBShah - is changing a module in which some name is defined (mean, mean!, var, and std in the case of this PR) considered non-breaking?

Perhaps this is more of a @KristofferC or @StefanKarpinski question. I believe that since these are exported names, the module name is less important. And of course, the Statistics.jl package can still provide compatibility.

ViralBShah avatar Aug 27 '22 17:08 ViralBShah

Speaking for the serious statisticians: we have no interest in developing the package in a direction where it stops being practical

mschauer avatar Aug 27 '22 18:08 mschauer

Recent discussions over in Discourse (see for example here) make it clear that the desire of the professional statisticians in the community is to adapt Statistics.jl to meet their needs. Among other things, there is a clear reluctance to continue to support functions that take anything other than tables. Such changes would make it difficult and inconvenient for me (and I suspect, many others) to use that package in my own projects.

AFAICT this is only what @juliohm proposed. None of the main developers of Statistics and StatsBase have suggested dropping the existing API in favor of another API requiring to use Tables.jl. My goal with this PR is merely to merge Statistics and StatsBase, without changing most of their features.

nalimilan avatar Aug 27 '22 18:08 nalimilan

For the most common use cases, e.g. mean(f, arr), is it really unreasonable to expect users to just write sum(f.(arr))/length(arr) themselves? And for anything more complicated, shouldn't they just install the Statistics package anyway?

I cannot emphasize enough how @DilumAluthge's statement translates well my feelings about the current status quo. It would be much much much simpler to erase Statistics from Base and let the stats/tables community drive the evolution of these features forward. We have so many talented people driving DataFrames.jl, Tables.jl, StatsAPI.jl, ..., there is enough movement to advance things further in major ways without waiting for Julia release cycles.

Also, what makes mean, std, and var special enough that they need to be included in Base, but other functions don't need to be?

Also agree with @DilumAluthge here. I don't think these three convenience functions justify all the contortion that stats developers have to make to maintain multiple dependencies. When people argue that mean should be part of Base, they mean they want a a one-liner convenience function that is equivalent to sum(x) / length(x) or something similar for iterables. There is so much more one can do with mean/std/var in a statistical context! Did you know for example that GeoStats.jl's mean, std and quantile are aware of geospatial coordinates? There are so many other features that are missing in Base, which will take forever to implement with Julia release cycles.

We could also just remove Statistics for now and see how strongly people complain.

That is my number one wish @nalimilan 🙏🏽

What if Base had, say, average, and we then have everything else (mean, std, var) in Statistics?

Love this idea as well @ViralBShah ! Just give users alternative convenience functions like avg with sensible and basic implementations and move everything else to outside of Base where weights, stats corrections are considered.

What I am concerned about, though, is the future direction of the Statistics package and the impllications for people like me who just need basic functionality and whose data is mostly represented as regular, plain vectors and matrices.

I don't see how these future directions can affect operations with regular, plain vectors and matrices @mbaz. First you need to make sure that you are talking about the statistical definitions of mean, std, var with weights, corrections etc. Then, you need to ask yourself if any effort operating over rows/columns of tables would necessarily exclude low-level functions for arrays. My understanding is that even if we move forward with tables as first-class objects, there will be low-level functions still that could be exported for users interested in plain arrays. That is super simple and shouldn't affect our decision process here in my opinion.

juliohm avatar Aug 27 '22 19:08 juliohm

I like the idea of simple average in base. It was a big discussion why using statistics is necessery for mean, but so is julia now and it is ok. To put mean, std and var in base would be a step in the wrong direction.

strickek avatar Aug 28 '22 07:08 strickek

the one thing that I am not sure of is, that with the package manager improvements in the last 4 years, will people have the same reaction to mean not being in Base as they did in 2018.

The package manager has improved greatly but it doesn't actually solve the simplest case here, which is that a new user will fire up the REPL and try to compute a basic quantity like the mean, only to be faced with:

julia> mean(x)
ERROR: UndefVarError: mean not defined
Stacktrace:
 [1] top-level scope
   @ REPL[1]:1

This doesn't tell you that you could compute a mean if you load a package first, nor which package would provide it. So the improvements to the package manager don't really affect the initial experience, they've just improved the experience once you figure out what you need to do to compute a mean.

What if Base had, say, average, and we then have everything else (mean, std, var) in Statistics?

This actually seems worse to me, since it would be unclear whether developers should define methods for mean or average. (The term "average" is also technically ambiguous, as it can also refer to the median or mode depending on the context, though arithmetic mean is obviously most common.) The situation would be somewhat like what Milan noted with separate functions in Python and NumPy that do the same thing.

There is so much more one can do with mean/std/var in a statistical context! [...] There are so many other features that are missing in Base, which will take forever to implement with Julia release cycles.

Packages can define methods for mean that provide all kinds of additional functionality, that's the ✨ m a g i c ✨ of dispatch. Just because mean lives somewhere doesn't mean that all possible methods it could have need to live in the same place. Providing the basic definitions that act on Base types—what Milan has done here—gives the majority of users as much as they need. Why would people be waiting on release cycles? If Base defines a new type and a mean method for it in some new release, the mean method can be added to Compat alongside the type, as is done for any other Base function.

is changing a module in which some name is defined (mean, mean!, var, and std in the case of this PR) considered non-breaking?

It is not breaking, see e.g. https://github.com/JuliaLang/julia/pull/35628 for prior art.

what makes mean, std, and var special enough that they need to be included in Base, but other functions don't need to be?

They're ubiquitous in any discipline that deals with numbers. I think just about everyone learns them in school at a fairly young age. I remember computing standard deviations in my 7th grade math class, and it wasn't an honors class or anything.

ararslan avatar Aug 30 '22 00:08 ararslan

The package manager has improved greatly but it doesn't actually solve the simplest case here, which is that a new user will fire up the REPL and try to compute a basic quantity like the mean, only to be faced with:

We already have ways to hook into errors though:

 Base.Experimental.register_error_hint(UndefVarError) do io, exc
    if exc.var == :mean
        print(io, """\nmean is provided by the Statistics.jl package, run \
                     `using Pkg; Pkg.add(\"Statistics\"); using Statistics` \
                     to install and load it.""")
    end
end
julia> mean
ERROR: UndefVarError: mean not defined
mean is provided by the Statistics.jl package, run `using Pkg; Pkg.add("Statistics"); using Statistics` to install and load it.

KristofferC avatar Aug 30 '22 10:08 KristofferC

That's cool! Would make sense to open an issue and think about for which functions this would be a good idea to soften landing for newcomers? For example for I which often shows up in formulas without having using LinearAlgebra in the context.

mschauer avatar Aug 30 '22 10:08 mschauer

We already have ways to hook into errors though

TIL, that's very cool. We could even go a step further and do

Base.Experimental.register_error_hint(UndefVarError) do io, ex
    for package in packages  # just some predefined list of (ex-)stdlibs
        if isdefined(package, ex.var) && isexported(package, ex.var)
            print(io, ex.var, " is provided by the ", package, " package, run ",
                  "`using Pkg; Pkg.add(\"", package, "\"); using ", package,
                  " to install and load it.")
        end
    end
end

which avoids needing to determine some list of identifiers to provide this for, though it would require determining a list of packages. From the user's perspective, it'd be somewhat like the experience for @deprecate_moved, which was employed heavily back when stuff was first moving out of Base. What I've described here is kind of an aside though; at least using this as you noted for mean would address my first point.

That said, I still think Milan's approach in this PR makes sense and would be a net improvement.

I'd be happier with variance instead of var

FWIW this seems reasonable if people are worried about Base "stealing" short names (and we could rename std to stddev analogously) but it's also worth noting that the functions have been around forever under these names, so keeping them as-is seems most convenient.

ararslan avatar Aug 30 '22 14:08 ararslan

I think concerns about having to manually install Statistics.jl as a package are overblown. Multiple extremely popular packages directly or transitively depend on Statistics, so it will be installed for most users automatically. So if mean, std, and var are not added to Base, the user experience will be basically unchanged:

julia> using Statistics # currently/already necessary
 │ Package Statistics not found, but a package named Statistics is available from a registry.  # ╮
 │ Install package?                                                                            # │─ Very unlikely
 │   (@v1.8) pkg> add Statistics                                                               # │
 └ (y/n/o) [y]:                                                                                # ╯

julia> mean(x)

I don't think that adding mean, std, and var to Base is necessary to de-stdlib-ify Statistics. (And it would be easy to add them after the fact if there is a large enough outcry.)

halleysfifthinc avatar Aug 31 '22 16:08 halleysfifthinc