linfa icon indicating copy to clipboard operation
linfa copied to clipboard

Roadmap

Open LukeMathWalker opened this issue 4 years ago • 81 comments

In terms of functionality, the mid-term end goal is to achieve an offering of ML algorithms and pre-processing routines comparable to what is currently available in Python's scikit-learn.

These algorithms can either be:

  • re-implemented in Rust;
  • re-exported from an existing Rust crate, if available on crates.io with a compatible interface.

In no particular order, focusing on the main gaps:

  • Clustering:

    • [x] DBSCAN
    • [x] Spectral clustering;
    • [x] Hierarchical clustering;
    • [x] OPTICS.
  • Preprocessing:

    • [x] PCA
    • [x] ICA
    • [x] Normalisation
    • [x] CountVectoriser
    • [x] TFIDF
    • [x] t-SNE
  • Supervised Learning:

    • [x] Linear regression;
    • [x] Ridge regression;
    • [x] LASSO;
    • [x] ElasticNet;
    • [x] Support vector machines;
    • [x] Nearest Neighbours;
    • [ ] Gaussian processes; (integrating friedrich - tracking issue https://github.com/nestordemeure/friedrich/issues/1)
    • [x] Decision trees;
    • [ ] Random Forest
    • [x] Naive Bayes
    • [x] Logistic Regression
    • [ ] Ensemble Learning
    • [ ] Least Angle Regression
    • [x] PLS

The collection is on purpose loose and non-exhaustive, it will evolve over time - if there is an ML algorithm that you find yourself using often on a day to day, please feel free to contribute it :100:

LukeMathWalker avatar Dec 01 '19 23:12 LukeMathWalker

Hi, I'm eager to help I'll take Linear regression, Lasso and ridge.

Nimpruda avatar Dec 02 '19 12:12 Nimpruda

Cool! I worked a bit on linear regression a while ago - you can find a very vanilla implementation of it here: https://github.com/rust-ndarray/ndarray-examples/tree/master/linear_regression @Nimpruda

LukeMathWalker avatar Dec 02 '19 12:12 LukeMathWalker

What does Normalization mean, is it like sklearn's StandardScaler or something else?

InCogNiTo124 avatar Dec 02 '19 13:12 InCogNiTo124

Exactly @InCogNiTo124.

LukeMathWalker avatar Dec 02 '19 15:12 LukeMathWalker

This is an interesting project and I will work on the PCA implementation

ADMoreau avatar Dec 02 '19 17:12 ADMoreau

I am the author of the friedrich crate which implements Gaussian Processes.

While it is still a work in progress, it is fully featured and I would be happy to help integrate it into the project if you have directions to do so.

nestordemeure avatar Dec 02 '19 20:12 nestordemeure

That would be awesome @nestordemeure - I'll have a look at the project and I'll get back to you! Should I open an issue on friedrich's repository when I am ready? Or would you prefer it to be tracked here on the linfa repository?

LukeMathWalker avatar Dec 02 '19 20:12 LukeMathWalker

Both are ok with me.

An issue in friedrich's repository might help avoid overcrowning linfa with issues but do as you prefer.

nestordemeure avatar Dec 02 '19 20:12 nestordemeure

I'd love to take the Nearest Neighbors implementation

mstallmo avatar Dec 02 '19 20:12 mstallmo

I think this is really great, I just started on a sklearn like implementation of their pipelines, here but more or less for experimentation without anything serious. I'll be sure to keep my eye on issues/goals here and help out where I can. Thanks for the initiative! :clap:

milesgranger avatar Dec 03 '19 07:12 milesgranger

Hi there! First off, I don't have any experience in ML, but I read a lot about it (and listen to way too many podcasts on the topic). I'm interested in jumping in. I have quite some experience developing in Rust, and specifically high fidelity simulation tools (cf nyx and hifitime).

I wrote an Ant Colony Optimizer in Rust. ACOs are great for traversing graphs which represent a solution space, a problem which is considered NP hard if I'm not mistaken. Is that something used at all in ML? If so, would it be of interest to this library, or is there a greater interest (for now) to focus on the problems listed in the first post?

Cheers

ChristopherRabotin avatar Dec 03 '19 22:12 ChristopherRabotin

Hi @ChristopherRabotin I've never heard of ACOs but as it's in relation with graphs you should check if it has any uses with Markov Chains.

Nimpruda avatar Dec 04 '19 10:12 Nimpruda

So far, I haven't found how both can be used together. The closest I found was finding several papers which use Markov Chains to analyze ACOs.

ChristopherRabotin avatar Dec 04 '19 18:12 ChristopherRabotin

I would like to take the Naive Bayes one.

onehr avatar Dec 06 '19 22:12 onehr

I'll take on Gaussian Processes.

tyfarnan avatar Dec 08 '19 03:12 tyfarnan

I'll put some work towards the text tokenization algorithms (CountVectorizer and TFIDF). I'm also extremely interested in a good SVM implementation in Rust. Whoever is working on that, let me know if you'd like some help or anything.

bplevin36 avatar Dec 08 '19 12:12 bplevin36

Please take a look at what is already out there before diving head down into a reimplementation @tyfarnan - I haven't had the time to look at friedrich by @nestordemeure yet (taking a break after the final push to release the blog post and related code 😅) but we should definitely start from there as well as the GP sub-module in rusty-machine.

LukeMathWalker avatar Dec 08 '19 18:12 LukeMathWalker

@tyfarnan, don't hesitate to contact me via an issue on friedrich's repository once @LukeMathWalker has explicited what is expected of code that is integrated into Linfa and how this integration will be done.

nestordemeure avatar Dec 08 '19 19:12 nestordemeure

I did a quick round up of crates that implement the algorithms listed on the roadmap. Probably missed quite a few too but this can be a good starting point.

It was just a quick search so I don't know how reliavent each crate is but I tried to make a note if the crate was old and unmaintained. Hopefully this can be useful for helping with algorithm design or saving us from having to reimplement something that is already there.

Algo ecosystem gist

DallasC avatar Dec 11 '19 05:12 DallasC

Tracking friedrich<>linfa integration here: https://github.com/nestordemeure/friedrich/issues/1

LukeMathWalker avatar Dec 12 '19 08:12 LukeMathWalker

I have updated the Issue to make sure it's immediately clear who is working on what and what items are still looking for an owner 👍

LukeMathWalker avatar Dec 15 '19 18:12 LukeMathWalker

hey @LukeMathWalker could you add me next to the normalization? I plan to do it by New Year's as I'm still not very experienced with Rust, but I have an idea how to implement it

InCogNiTo124 avatar Dec 18 '19 20:12 InCogNiTo124

Done @InCogNiTo124 :pray:

LukeMathWalker avatar Dec 19 '19 21:12 LukeMathWalker

Started implementing DBScan in #12.

Also if there are suggestions Gaussian Mixture Models would be cool

xd009642 avatar Dec 24 '19 15:12 xd009642

Implementation of DBSCAN merged to master - thanks @xd009642 :pray:

LukeMathWalker avatar Dec 27 '19 09:12 LukeMathWalker

Hi, really cool project! I have a question concerning the scope: do you eventually want to have deep learning and reinforcement learning algorithms too? I guess I'm curious to know if adding them is the plan eventually, but you want to start with the easier stuff, or if you think along the line of the scikit dev themselves : here.

Either way, I'll be glad to help spread the rust gospel. Right know I'm going through the Reinforcement Learning book, and I will implement some of the algorithms; if that's in the scope of linfa, I'll be glad to try adding them to it. If not, I plan to read through Understanding Machine Learning afterwards, and thus will eventually reach some of the algorithms in the roadmap. Then I will help by implementing them. :)

adamShimi avatar Jan 09 '20 10:01 adamShimi

From previous discussions deep learning etc is out of scope for the same reasons as it is for sci-kit. @LukeMathWalker might have more to say about it or reinforcement learning :smile:

xd009642 avatar Jan 09 '20 10:01 xd009642

Ok, thanks.

adamShimi avatar Jan 10 '20 12:01 adamShimi

I would consider both of them to be out of scope for this project - it's already incredibly broad as it is right now :sweat_smile: I'd love to see something spawn up for reinforcement learning, especially gym environments!

LukeMathWalker avatar Jan 10 '20 19:01 LukeMathWalker

Can you also include Non-Negative Matrix Factorization (NMF) in the list for pre-processing steps. Its a standard algorithm in NLP/audio enhancement and decomposes a matrix into the product of two positive valued matrices. (https://en.wikipedia.org/wiki/Non-negative_matrix_factorization)

One of the nice properties is that there is a simple incremental algorithm for solving the the problem, with simple modification for sparsity constraints.

bytesnake avatar Mar 05 '20 09:03 bytesnake