ndarray-stats Roadmap

In terms of functionality, the mid-term end goal is to achieve feature parity with the statistics routine in numpy (here) and Julia StatsBase (here).

For the next version:

Order statistics:
- [ ] partialord version for quantiles methods;
Histograms:
- [ ] merge method;

For version 0.2.0:

Order statistics:
- [x] optimized computations of multiple quantiles if requested all at once (#26) ;
- [x] argmin / argmax (#30);
Summary statistics:
- [x] harmonic mean (#20);
- [x] geometric mean (#20);
- [x] higher order central moments (#23);
- [x] standardized moments (they include kurtosis and skewness) (#23);
Histograms:
- [x] Fix error handling (Issue: https://github.com/jturner314/ndarray-stats/issues/16 - PR: #25 )
Entropy:
- [x] Feature parity with StatsBase.jl (#24)

For version 0.1.0:

[x] max / nanmax (@jturner314)
[x] min / nanmin (@jturner314)
[x] quantile / nanquantile (it includes percentile / nanpercentile as a special case) (@LukeMathWalker & @jturner314)
[x] correlation-methods:
- [x] cov (@LukeMathWalker) - ~One last fix to be made (#3)~ [On hold for now]
- [x] corrcoef (@LukeMathWalker - #5)
[x] histogram-methods (@LukeMathWalker - #9)

Sep 16 '18 15:09 LukeMathWalker

With respect to mean, average, std: var is implemented in the main ndarray crate - would it make sense to port it here?

I think it makes sense for ndarray-stats to provide *_skipnan variants (or whatever you want to call them) of those methods. However, it would make sense to add std_axis to ndarray since ndarray already has var_axis.

For methods that are already in ndarray, we could duplicate these methods as a trait in ndarray-stats for people who want to write generic code (where the implementations just call the instance methods). I'm ambivalent on this.

I'll slowly start working on this next week and then I'll get serious the week afterwards. Could you please give me commit/PR permissions to the repository @jturner314?

Okay, that sounds good. I've given you push access. Alternatively, if you'd like to have your repo be the main one instead of this one, that would be fine with me.

Sep 16 '18 21:09 jturner314

Once #9 gets merged I think we are in a good position to officially release version 0.1.0 on crates.io - what do you think? @jturner314

Nov 08 '18 08:11 LukeMathWalker

I agree.

By the way, I recently came across Julia's StatsBase.jl library. It's a good source of ideas in addition to NumPy/SciPy.

Nov 11 '18 22:11 jturner314

Added a bunch of tests to #9 and merged 🎉 It feels like ages since I started to work on it :sweat_smile: Your contribution was extremely helpful to get it in the shape it is right now, thanks a lot @jturner314!

What do we need to do in order to release on crates.io? I am going to open a small PR to add crate-level documentation - a couple of lines, nothing major.

Nov 18 '18 16:11 LukeMathWalker

Yay! :tada: That was a big job; great work.

What do we need to do in order to release on crates.io?

Ideally, we'd eliminate the [patch.crates-io] section from the Cargo.toml before we can release on crates.io. (This might even be required, I'm not sure.) #11 removed the patch for noisy_float, but a new version of ndarray will need to be released for us to remove its patch. It would be nice to merge a couple more ndarray PRs before release; I'll take a look.

It would also be good to merge #12 and #13 before releasing.

Nov 18 '18 21:11 jturner314

Merged #12 and #13 - looking around it seems we can publish with [patch.crates-io] section in Cargo.toml, but I agree it is much nicer to point to ndarray 0.12.1 as a dependency instead of a revision on master.

Let's wait for that release and then we are good to go.

Nov 19 '18 08:11 LukeMathWalker

ndarray-stats 0.1.0 is now on crates.io. :tada: Thanks for all your hard work @LukeMathWalker!

Nov 21 '18 21:11 jturner314

💯 💯 I think it's safe to say it would have never got there without your help 😛 I'll drop a post on r/rust as well 👍

Nov 21 '18 22:11 LukeMathWalker

I have drafted a tentative roadmap with the features I'd like to add in the next release - please edit it with your comments and suggestions @jturner314

Nov 24 '18 17:11 LukeMathWalker

The roadmap looks good to me. I'm not familiar with the applications of higher order central moments (I'd usually use a histogram instead), but I don't mind adding them if people find them useful.

By the way, I invited you as an owner for the ndarray-stats crate, but I just realized that crates.io may not have sent the invitation if you haven't logged in before. Please let me know if you need me to re-send it.

Nov 27 '18 03:11 jturner314

Somehow I didn't receive an email notification, but the invite was on my dashboard - accepted it!

The main objective in that area is getting kurtosis and skewness, and given the kind of computation required to achieve that it makes sense to also roll out higher order central moments I'd say :)

Nov 29 '18 09:11 LukeMathWalker

Hey mate, argmin / argmax looks like simple enough to look into, do you have any suggestions of where to start?

Mar 09 '19 00:03 phungleson

Thanks for your interest! You'll want to add argmin and argmax methods to the QuantileExt trait and implement them. Please include documentation for the methods and some tests (in tests/quantile.rs).

I'd suggest starting with the existing implementation for min as a basis, but using .indexed_iter().fold() or .indexed_iter().try_fold() instead of .fold().

It would also be good to add argmin_skipnan and argmax_skipnan methods (analogous to min_skipnan and max_skipnan, but that's not necessary for the first PR.

Please feel free to ask if you have any questions.

Mar 09 '19 01:03 jturner314

Hey mates, I have added argmin_skipnan and argmax_skipnan, wonder why you use PartialOrd for min, but Ord for min_skipnan?

And what does this mean by this? partialord version for quantiles

Mar 11 '19 23:03 phungleson

It's because we require the data type to be MaybeNan: it basically means that, apart from a subset of elements (e.g. NaN for floats), we are dealing with a data type that is totally ordered (all pairs of elements can be compared, Ord).

This reduces the failure scope:

min can return None is a comparison fails (as it can happen, with PartialOrd) or if there is no element in the array.
min_skipnan returns None if and only if the array has no not-NaN element (because no comparison will be undefined).

This can be useful when you are dealing with floats or arrays with potentially missing values (e.g. Option<A>, where A: Ord).

Re: quantiles - the current implementation requires A to implement Ord. We'd like to relax it to allow A to be PartialOrd instead of Ord.

Mar 12 '19 10:03 LukeMathWalker

Thanks @LukeMathWalker for the last point, if we change A: Ord to A: PartialOrd and refactor the code + test to allow that change, it would complete the task right?

Mar 16 '19 01:03 phungleson

Exactly! @phungleson I'd suggest you to wait until #26 is merged before tackling this task, otherwise you are in for some nasty merge conflicts :stuck_out_tongue: I am almost there, I am just investigating some stack overflow errors in the revised version I have been writing.

Mar 16 '19 11:03 LukeMathWalker

Cool thanks @LukeMathWalker so seems like everything is more or less complete? Let me know if there are any doable features, cheers.

BTW merge method; seems to be straight forward but do you have any thoughts yet about the implementation?

Mar 21 '19 11:03 phungleson

For merge I read quickly, so basically just adding the weights?

for h in others
  target.weights .+= h.weights
end

Mar 21 '19 20:03 phungleson

Yes @phungleson, it basically boils down to summing together the weight matrices (plus or minus checking that their dimension/bins are compatible, I haven't looked into it). If you want to give it try, please go ahead!

Mar 21 '19 20:03 LukeMathWalker

I'd like to close existing work streams and cut a release - what does your bandwidth look like @jturner314 to review open PRs?

Mar 29 '19 14:03 LukeMathWalker

I've been meaning to look over the open PRs but haven't had a chance. I'll reserve time on Sunday to review them.

Mar 30 '19 00:03 jturner314

It seems I managed to publish 0.2.0 without making a mess :muscle: Thanks @jturner314 @phungleson and @munckymagik for all the work done on this release :heart:

I'd say we have done a major leap forward in terms of features - there are things that can be polished, the API design can be further improved and we can optimize the existing code, but ndarray-stats is definitely a viable solution right now :rocket:

I'll clean up the parent post to move items that we didn't manage to include in this release to the roadmap for the next one. I am not sure what we should be covering next in terms of major new functionality :thinking:

Apr 13 '19 10:04 LukeMathWalker

Well done all 👏

Apr 13 '19 13:04 munckymagik

Great job on 0.2.0 everyone!

I am not sure what we should be covering next in terms of major new functionality

A couple of ideas from StatsBase.jl:

Deviation functions
Weighted calculations (mean/std/etc.)

We could also add statistical models (e.g. linear regression), but that might be best put in a separate crate.

Apr 13 '19 16:04 jturner314

Well done! cheers!

Apr 15 '19 23:04 phungleson

A couple of ideas from StatsBase.jl:

Deviation functions

Weighted calculations (mean/std/etc.)

Unless any of you have made a start on these, I'd be interested in having a go at either, or contributing. I'll try to spend some time in the next couple of days looking at what is involved with the Deviation functions.

❓ Does anyone have any implementation suggestions other than just trying to port from StatsBase.jl?

If anyone wants to collaborate on the code then let me know.

Apr 17 '19 09:04 munckymagik

Ok I made a start: https://github.com/jturner314/ndarray-stats/pull/41

Any advice for choosing traits bounds for the A element types? Is it ok to use Copy or do we need to support any types that would be Clone?

Apr 17 '19 21:04 munckymagik

I'd say to use clone @munckymagik

Apr 18 '19 07:04 LukeMathWalker

@LukeMathWalker thanks. What led you to that decision? Is there a particular data type you've seen used in ndarrays that would need this? If so I'm thinking I might use it in the test fixtures to make sure all methods have the same bounds.

Apr 18 '19 16:04 munckymagik

ndarray-stats ndarray-stats copied to clipboard

Roadmap

ndarray-stats
ndarray-stats copied to clipboard