ndarray-stats
ndarray-stats copied to clipboard
Roadmap
In terms of functionality, the mid-term end goal is to achieve feature parity with the statistics routine in numpy (here) and Julia StatsBase (here).
For the next version:
- Order statistics:
- [ ]
partialordversion forquantilesmethods;
- [ ]
- Histograms:
- [ ]
mergemethod;
- [ ]
For version 0.2.0:
- Order statistics:
- Summary statistics:
- [x] harmonic mean (#20);
- [x] geometric mean (#20);
- [x] higher order central moments (#23);
- [x] standardized moments (they include kurtosis and skewness) (#23);
- Histograms:
- [x] Fix error handling (Issue: https://github.com/jturner314/ndarray-stats/issues/16 - PR: #25 )
- Entropy:
- [x] Feature parity with StatsBase.jl (#24)
For version 0.1.0:
- [x]
max/nanmax(@jturner314) - [x]
min/nanmin(@jturner314) - [x]
quantile/nanquantile(it includespercentile/nanpercentileas a special case) (@LukeMathWalker & @jturner314) - [x]
correlation-methods:- [x]
cov(@LukeMathWalker) - ~One last fix to be made (#3)~ [On hold for now] - [x]
corrcoef(@LukeMathWalker - #5)
- [x]
- [x]
histogram-methods (@LukeMathWalker - #9)
With respect to
mean,average,std:varis implemented in the main ndarray crate - would it make sense to port it here?
I think it makes sense for ndarray-stats to provide *_skipnan variants (or whatever you want to call them) of those methods. However, it would make sense to add std_axis to ndarray since ndarray already has var_axis.
For methods that are already in ndarray, we could duplicate these methods as a trait in ndarray-stats for people who want to write generic code (where the implementations just call the instance methods). I'm ambivalent on this.
I'll slowly start working on this next week and then I'll get serious the week afterwards. Could you please give me commit/PR permissions to the repository @jturner314?
Okay, that sounds good. I've given you push access. Alternatively, if you'd like to have your repo be the main one instead of this one, that would be fine with me.
Once #9 gets merged I think we are in a good position to officially release version 0.1.0 on crates.io - what do you think? @jturner314
I agree.
By the way, I recently came across Julia's StatsBase.jl library. It's a good source of ideas in addition to NumPy/SciPy.
Added a bunch of tests to #9 and merged 🎉 It feels like ages since I started to work on it :sweat_smile: Your contribution was extremely helpful to get it in the shape it is right now, thanks a lot @jturner314!
What do we need to do in order to release on crates.io? I am going to open a small PR to add crate-level documentation - a couple of lines, nothing major.
Yay! :tada: That was a big job; great work.
What do we need to do in order to release on crates.io?
Ideally, we'd eliminate the [patch.crates-io] section from the Cargo.toml before we can release on crates.io. (This might even be required, I'm not sure.) #11 removed the patch for noisy_float, but a new version of ndarray will need to be released for us to remove its patch. It would be nice to merge a couple more ndarray PRs before release; I'll take a look.
It would also be good to merge #12 and #13 before releasing.
Merged #12 and #13 - looking around it seems we can publish with [patch.crates-io] section in Cargo.toml, but I agree it is much nicer to point to ndarray 0.12.1 as a dependency instead of a revision on master.
Let's wait for that release and then we are good to go.
ndarray-stats 0.1.0 is now on crates.io. :tada: Thanks for all your hard work @LukeMathWalker!
💯 💯 I think it's safe to say it would have never got there without your help 😛 I'll drop a post on r/rust as well 👍
I have drafted a tentative roadmap with the features I'd like to add in the next release - please edit it with your comments and suggestions @jturner314
The roadmap looks good to me. I'm not familiar with the applications of higher order central moments (I'd usually use a histogram instead), but I don't mind adding them if people find them useful.
By the way, I invited you as an owner for the ndarray-stats crate, but I just realized that crates.io may not have sent the invitation if you haven't logged in before. Please let me know if you need me to re-send it.
Somehow I didn't receive an email notification, but the invite was on my dashboard - accepted it!
The main objective in that area is getting kurtosis and skewness, and given the kind of computation required to achieve that it makes sense to also roll out higher order central moments I'd say :)
Hey mate, argmin / argmax looks like simple enough to look into, do you have any suggestions of where to start?
Thanks for your interest! You'll want to add argmin and argmax methods to the QuantileExt trait and implement them. Please include documentation for the methods and some tests (in tests/quantile.rs).
I'd suggest starting with the existing implementation for min as a basis, but using .indexed_iter().fold() or .indexed_iter().try_fold() instead of .fold().
It would also be good to add argmin_skipnan and argmax_skipnan methods (analogous to min_skipnan and max_skipnan, but that's not necessary for the first PR.
Please feel free to ask if you have any questions.
Hey mates, I have added argmin_skipnan and argmax_skipnan, wonder why you use PartialOrd for min, but Ord for min_skipnan?
And what does this mean by this? partialord version for quantiles
It's because we require the data type to be MaybeNan: it basically means that, apart from a subset of elements (e.g. NaN for floats), we are dealing with a data type that is totally ordered (all pairs of elements can be compared, Ord).
This reduces the failure scope:
mincan returnNoneis a comparison fails (as it can happen, withPartialOrd) or if there is no element in the array.min_skipnanreturnsNoneif and only if the array has no not-NaN element (because no comparison will be undefined).
This can be useful when you are dealing with floats or arrays with potentially missing values (e.g. Option<A>, where A: Ord).
Re: quantiles - the current implementation requires A to implement Ord. We'd like to relax it to allow A to be PartialOrd instead of Ord.
Thanks @LukeMathWalker for the last point, if we change A: Ord to A: PartialOrd and refactor the code + test to allow that change, it would complete the task right?
Exactly! @phungleson I'd suggest you to wait until #26 is merged before tackling this task, otherwise you are in for some nasty merge conflicts :stuck_out_tongue: I am almost there, I am just investigating some stack overflow errors in the revised version I have been writing.
Cool thanks @LukeMathWalker so seems like everything is more or less complete? Let me know if there are any doable features, cheers.
BTW merge method; seems to be straight forward but do you have any thoughts yet about the implementation?
For merge I read quickly, so basically just adding the weights?
for h in others
target.weights .+= h.weights
end
Yes @phungleson, it basically boils down to summing together the weight matrices (plus or minus checking that their dimension/bins are compatible, I haven't looked into it). If you want to give it try, please go ahead!
I'd like to close existing work streams and cut a release - what does your bandwidth look like @jturner314 to review open PRs?
I've been meaning to look over the open PRs but haven't had a chance. I'll reserve time on Sunday to review them.
It seems I managed to publish 0.2.0 without making a mess :muscle:
Thanks @jturner314 @phungleson and @munckymagik for all the work done on this release :heart:
I'd say we have done a major leap forward in terms of features - there are things that can be polished, the API design can be further improved and we can optimize the existing code, but ndarray-stats is definitely a viable solution right now :rocket:
I'll clean up the parent post to move items that we didn't manage to include in this release to the roadmap for the next one. I am not sure what we should be covering next in terms of major new functionality :thinking:
Well done all 👏
Great job on 0.2.0 everyone!
I am not sure what we should be covering next in terms of major new functionality
A couple of ideas from StatsBase.jl:
- Deviation functions
- Weighted calculations (mean/std/etc.)
We could also add statistical models (e.g. linear regression), but that might be best put in a separate crate.
Well done! cheers!
A couple of ideas from StatsBase.jl:
- Deviation functions
- Weighted calculations (mean/std/etc.)
Unless any of you have made a start on these, I'd be interested in having a go at either, or contributing. I'll try to spend some time in the next couple of days looking at what is involved with the Deviation functions.
❓ Does anyone have any implementation suggestions other than just trying to port from StatsBase.jl?
If anyone wants to collaborate on the code then let me know.
Ok I made a start: https://github.com/jturner314/ndarray-stats/pull/41
Any advice for choosing traits bounds for the A element types? Is it ok to use Copy or do we need to support any types that would be Clone?
I'd say to use clone @munckymagik
@LukeMathWalker thanks. What led you to that decision? Is there a particular data type you've seen used in ndarrays that would need this? If so I'm thinking I might use it in the test fixtures to make sure all methods have the same bounds.