StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

Proposal: Move histograms to separate package

Open oschulz opened this issue 3 years ago • 14 comments

Histograms are kind of a standalone thing in StatsBase right now - they're not used anywhere in StatsBase, and they use little to no functionality from StatsBase themselves. So I propose to factor them out into a separate package "JuliaStats/Histograms.jl".

StatsBase.jl could reexport Histograms.jl for a while, to keep compatibility, and then stop reexporting it with the next breaking release of StatsBase.

@andreasnoack , @simonbyrne , what do you think?

oschulz avatar Jan 30 '21 12:01 oschulz

I'd be willing to take care of the complete transition.

oschulz avatar Jan 30 '21 12:01 oschulz

What would be the advantage of doing this? I mean, most parts of StatsBase could live independently from each other, but that makes it hard for users to find where the functionality lives.

nalimilan avatar Jan 31 '21 11:01 nalimilan

Hm, I had the impression that other parts of StatsBase were coupled more tightly. And it would make it easier to evolve Histograms since StatsBase is such a central package that 0.x updates with breaking changes take ages to propagate though the ecosystem - so it takes a long time before higher-level packages can access the new version. Far, far fewer packages use histograms than use StatsBase. And just in general I thought more modularity would be a good thing. Also, thought (I speak under correction) that there was a longer term plan to restructure/modularize StatsBase anyhow (move parts to Statistics, other structural changes)?

oschulz avatar Jan 31 '21 12:01 oschulz

Yes the plan has been for a long time to move essential features to Statistics and other parts to specialized packages. So I assume you think histograms belong to the latter rather than the former?

And it would make it easier to evolve Histograms since StatsBase is such a central package that 0.x updates with breaking changes take ages to propagate though the ecosystem - so it takes a long time before higher-level packages can access the new version.

Do you mean that you would like to make breaking changes to histogram code? Otherwise we can tag bugfix releases of StatsBase. We could also tag 1.0 so that we can make minor releases when we add new features.

nalimilan avatar Jan 31 '21 17:01 nalimilan

So I assume you think histograms belong to the latter rather than the former?

Yes, I think both StatsBase and a new "Histograms" package could both depends on Statistics without depending on each other.

Do you mean that you would like to make breaking changes to histogram code?

Well, #513 and #514 (dragged them on too long, but I really want to work on those now) will require changes to the Histogram struct that may be too substantial for a bugfix release. If there are breaking will depend on how I do it exactly - and also on whether addition of new type parameters (array type, for #513) would be considered breaking or not.

This can all be done within StateBase, of course, I just thought that if I'm going to do a bit of an overhaul on Histogram, it would be a good time to factor it out. StatsBase is a bit of a monster by now ... and maybe package load times would profit as well from increased modularity?

oschulz avatar Jan 31 '21 20:01 oschulz

I can't remember why exactly, but KernelDensity.jl has its own implementation: https://github.com/JuliaStats/KernelDensity.jl/blob/2c91f3f815900fc0dbcb2aef638b89a200604a16/src/univariate.jl#L99-L119 it would be good to make them able to use the same underlying functions.

simonbyrne avatar Jan 31 '21 22:01 simonbyrne

I agree.

oschulz avatar Feb 01 '21 02:02 oschulz

Well, #513 and #514 (dragged them on too long, but I really want to work on those now) will require changes to the Histogram struct that may be too substantial for a bugfix release. If there are breaking will depend on how I do it exactly - and also on whether addition of new type parameters (array type, for #513) would be considered breaking or not.

Adding type parameters or changing the internal fields of the struct isn't considered as breaking. Anyway regarding #514 it seems that @andreasnoack wanted to see a proposal first and that didn't happen yet. So better decide whether we want to make changes before considering them as breaking. Also since we want to move to 1.0, being breaking isn't the end of the world, that's a good occasion to stop using 0.x.

This can all be done within StateBase, of course, I just thought that if I'm going to do a bit of an overhaul on Histogram, it would be a good time to factor it out. StatsBase is a bit of a monster by now ... and maybe package load times would profit as well from increased modularity?

"Monster" sounds like an exaggeration. StatsBase seems to load quite quickly, and it's much smaller than many packages. Yes it's a collection of many small functions in different areas, but that makes them easier to discover.

I can't remember why exactly, but KernelDensity.jl has its own implementation: https://github.com/JuliaStats/KernelDensity.jl/blob/2c91f3f815900fc0dbcb2aef638b89a200604a16/src/univariate.jl#L99-L119 it would be good to make them able to use the same underlying functions.

Sure. But KernelDensity already depends on StatsBase so that can perfectly be done now by implementing the relevant functions there.

Overall I think the only relevant discussion is about the general organization of the stats packages. If we are certain we don't want histograms to live in Statistics at some point then it could make sense to move them to a separate package. However that's a quite limited set of features so I'm not sure that makes sense. We're not going to create one package for each page in the manual, that would be a nightmare for users and maintainers.

nalimilan avatar Feb 01 '21 08:02 nalimilan

Adding type parameters or changing the internal fields of the struct isn't considered as breaking.

Thanks!

Anyway regarding #514 it seems that @andreasnoack wanted to see a proposal first and that didn't happen yet.

Sure, I'll have to flesh that out, of course - was planning to do that in the form of a PR.

"Monster" sounds like an exaggeration. StatsBase seems to load quite quickly

Yes, I admit that may have been a bit strong. :-) Package load time is about 0.35 s on my system (on v1.6-beta1), which is certainly not horrible (but also not negligible, IMHO).

Overall I think the only relevant discussion is about the general organization of the stats packages.

I agree. I wasn't trying to argue that splitting off histograms is in any way necessary for impromevments like #513 and #514. It's just while I was finally finding some time to get on that, that I realized that a separate Histograms package might be a good idea in general. But this proposal is indeed not coupled to #513 and #514.

oschulz avatar Feb 01 '21 12:02 oschulz

@Moelf , you're also interested in histograms, what's your take on this?

oschulz avatar Feb 01 '21 18:02 oschulz

@oschulz thx for pinning me! I'm leaning towards "yes". I think a few good things can spin out of it:

There are quite a few things we want to have in a comprehensive/"SOTA" histogram pkg:

  • More functionality: error handling and propagation
  • More computing backend: GPU
  • Niche things: sampling from a histogram, sophisticated auto binning algorithms, underflow/overflow bin
  • Specialization: faster algorithms for specific kind of histogram (manual, special heuristics)

These can be part of the StatsBase but it also may feel too heavy to maintain maybe.


An alternative would be to keep some basic things in StatsBase but separate and greatly re-write / extend a new Histograms.jl package. This is how I ~always envisioned. (Abstract)Histogram as an interface if you want to call it.

(I actually have a very crude fast 1D histogram implementation w/ error that is precisely based on AbstractHistograms in StatsBase.)

I have strong interests in contributing to this, I will keep an eye on this thread.

Moelf avatar Feb 03 '21 03:02 Moelf

Thanks for your input, @Moelf !

This would actually be a good motivation for a standalone Histograms package - some of the stuff you want could be done within StatsBase, but the rest would IMHO better fit into a separate package. Maybe your "nice things" could be included too, without making it too heavy. Otherwise more special histogram types could of course go into separate packages, as you suggest.

oschulz avatar Feb 03 '21 04:02 oschulz

I dont know if its necessary to get move histograms into its own package. If it's really self contained than that makes sense. But I don't think it should be a moving target of a package.

However I wouldn't mind an overhaul for how histograms support missings. See here.

pdeffebach avatar Feb 17 '21 00:02 pdeffebach

But I don't think it should be a moving target of a package.

Oh, no. But histograms are really pretty much self-contained in StatsBase.

However I wouldn't mind an overhaul for how histograms support missings. See here.

Good point!

oschulz avatar Feb 17 '21 03:02 oschulz