discussion icon indicating copy to clipboard operation
discussion copied to clipboard

What do we want to build?

Open LukeMathWalker opened this issue 5 years ago • 108 comments

Welcome!

I created this repository as a discussion hub for the ML ecosystem in Rust, "following" a talk I gave at the Rust meetup in London (slides).

I do believe that Rust has great potential in this area, but to fully realize this potential we need to provide building blocks: we need to tackle those shared challenges that, once removed, will enable more and more people to just come to Rust and build what they want to build.

The three building blocks I do see as fundamental for an ML ecosystem are:

  • n-dimensional arrays;
  • dataframes;
  • an ML model interface.

I have spent the last year, when it comes to open-source contributions, enhancing n-dimensional arrays: direct contributions to ndarray, statistical routines on top of it (ndarray-stats) and tutorials to help people to get into the Rust scientific ecosystem from Python, Julia or R. I do believe that ndarray is in more than a good shape when it comes to fulfil NumPy's role in the Rust ecosystem.

There is now movement as well when it comes to dataframes - a discussion is taking place at https://github.com/rust-dataframe/discussion/issues/1 to explore use cases and potential designs. (The idea of opening this repository comes directly from this experiment of community-led design for dataframes).

Given that one of the two data structures that are usually consumed by ML models is ready (n-dimensional arrays) and the other one is baking (dataframes) I think it's time to start thinking about what to do with the ML-specific piece.

I don't want to steer the debate too much with the opening post (I'll chip in once the discussion starts), but the questions I'd like to see tackled are:

  • what use-cases could make Rust shine in the ML ecosystem?
  • what are the basic capabilities that have to be built to enable the usage of Rust for ML workloads?
  • how should we structure such a project? A core library with few traits and a set of separate crates tackling different aspects? A large battery-included scikit-learn equivalent?
  • why do you want to use Rust for ML?

LukeMathWalker avatar Apr 28 '19 09:04 LukeMathWalker

I want to note that, while it works great, https://github.com/twistedfall/opencv-rust is not particularly user-friendly or 'clean' in rust terms.

Maybe we could have a look at it?

Kibouo avatar May 01 '19 12:05 Kibouo

I think the use case that coud make Rust shine, is deployment. Currently the de-facto "mainstream" stack is python based ( scikit-learn, np, pandas + your DL framework of choice, TensorFlow, Torch .. ). It shines for fast prototyping, because python, but it sucks for industrialization ( and deployment ), because ... python. I really think rust would do great. In that area, I kinda like TensorFlow serving, but it forces you to have a separate service ( that you call with their protobuf/RPC). So :

  • nice conventions for training/inference
  • standard ways of serializing, loading models, and expose them to more "entreprisey" stacks, either with some kind of FFI ( for ex. jvm <-> jni ..) or RPC.
  • with all the goodies required for a industrial setups ( monitoring, robustness, ease of deployment ...).

flo-dhalluin avatar May 01 '19 13:05 flo-dhalluin

I'm currently building a large project with rust (i mention it here: https://users.rust-lang.org/t/interest-for-nlp-in-rust/15331/9), where I am doing the data engineering in rust (lots of string metrics). [tldr; I found lots of disparate projects with 50% of what I needed for string metrics but instead rolled my own, trying to incorporate previous work and give credit] I want to feed the feature vectors to Julia to experiment with what I want to use for classification and modelling, and then I'll want to be able to use rust for inference/classification etc.... I had to pause development for business reasons, but I'm starting again: one of my biggest issues was not ML related but finding a nice pattern for parallel file download (seems like it should be simple, but maybe I'm spoiled by go's simplicity lol).

From this real-world project point of view, as well from my time spent thinking in the abstract and surveying the ML ecosystem in rust (about a year), I would think that a focus on data engineering in general and serving models is the way to go (this also seems to be a widely shared sentiment). In a practical sense I would like to see rust jobs for data engineer and machine learning engineer... that is, the bookends of a typical data science project; serving the data and serving the model.

That is, targeting software developers, infrastructure, computational math, and data people. Trying to convince research scientists to use Rust would be wasted effort; for most of these people software is a secondary skill and so they need something easy to learn, dynamically typed, with a REPL... I've watched this play out in the Python/R/Matlab versus Julia world... and while IMO Julia has a lot to offer current python/r/matlab devs and is similar enough to those languages, trying to get that group of people to use Julia is not easy, i can't imagine what it'd be like proposing Rust.

Here are some challenges I see:

  • Dataframes: figuring out what to do with missing data is a challenge (i watched julia community struggle with that this last year).
  • LinearAlgebra: ndarray, nalgebra are both active projects... is there duplicated effort? (there are others as well).
  • Rust types more friendly for math: I've seen the power in Julia of being able to specify AbstractArray as a type, or have a Real as a type, that allows you to build generic functions that accept a vector of float32 or float64.
  • Swift: google and numerous well-known people (chris lattner[LLVM, Swift], jeremy howard[fast.ai]) have put their support behind swift for tensorflow. IMO swift has really long way to go. But for Rust, tackling areas where the swift-for-tf project are not focusing on is good.
  • Support for Julia: integration with Python is a necessity; but if there is a competitor for the research-scientist in the Python world it is Julia and I'd imagine keeping an eye on playing well with Julia could be a benefit. Competition here is hard to forecast and julia/rust are on really different ends of the spectrum; while julia pushes solving the "two language" problem i see no problem using rust and julia in project; I doubt competition is an issue, not like with Swift.

jbowles avatar May 01 '19 14:05 jbowles

I do believe that ndarray is in more than a good shape when it comes to fulfil NumPy's role in the Rust ecosystem.

Really looking forward to digging into ndarray. Though I've had a slight delay, I'm writing up ndarray examples for the Grokking Deep Learning book where andrew trask introduces deep learning with only numpy. He's expressed interest and welcomed the examples... :)

jbowles avatar May 01 '19 14:05 jbowles

A standardized tokenization implementation!

Tokenization fills the role of "turn the text into fixed vectors" that you'd feed into standard models. As an NLP practitioner and Rust user, tokenization is an incredibly important step in the pipeline, a big barrier to new people trying to apply NLP, and a place where lots of small bugs creep in due to non-standard implementations that take forever to find. Having a standard implementation for the simpler tokenization methods (like regex matching) would make NLP problems much more approachable in Rust.

soaxelbrooke avatar May 01 '19 16:05 soaxelbrooke

One part of machine learning where Rust could shine right now is simulation for Reinforcement learning.

For instance if I training an agent to play blackjack, the biggest bottleneck here is the "playing" blackjack over and over by the agent to collect enough data for training.

Rayon and Actix could be used to create fast and performant game "environments" now, without need for an established ML ecosystem.

DhruvDh avatar May 01 '19 16:05 DhruvDh

I agree with @DhruvDh, using Rust to simulate environments for RL agents would be great.

Having something akin to OpenAI's gym interface would be really nice. Many RL researchers are going to still want to use Python and all the associated deep learning libraries. So, I would love to see RL environments rendered in Rust that could be interfaced with both Python and Rust for agents.

Edit: I imagine that algorithms like Monte Carlo Tree Search would be really useful if they were written in Rust. I would not want to wait on Python to handle that bit.

yngtodd avatar May 01 '19 17:05 yngtodd

if I training an agent to play blackjack, the biggest bottleneck here is the "playing" blackjack over and over by the agent to collect enough data for training.

Along these lines, I am working on a hobby project (link), which does this. It isn't quite ready for even an alpha release yet, but I am in the final stages of cleaning up the API with the intent to publish it.

masonk avatar May 01 '19 19:05 masonk

Things Rust definitely needs:

const generics 16-bit floats GATs (for efficient, non-copying iterators)

Things that we might want but I'm not sure: Standard Inference + Train traits Standard data frames trait

masonk avatar May 01 '19 19:05 masonk

I've been thinking of building a rust deep learning / GPU compute library on top of the TVM framework for a while now. I think it could address a lot of the things @flo-dhalluin is talking about. TVM's an amazing project that's currently flying a bit under the radar. It's an open source deep learning compiler - it compiles deep neural nets / large array operations to run on the GPU (or on OpenCL, or FPGA, or TPU, or WebGL...). You define an AST of computations via its API, and it spits out a small (<5mb) shared library containing just the operations you wanted, on whatever acceleration framework and target platform you want.

It currently has a working Rust runtime library, which lets you call a compiled model from Rust. It integrates with ndarray, and will let you e.g. take in an ndarray::Array, move it to a GPU, run whatever numerical operations you want on it, and get the result back as an ndarray::Array again.

That's pretty neat, and I don't think it would be too hard to build some really cool tools on top of it. My dream is something like:

lib.rs:

// a crate based on tvm
// `cargo build` will (by default) download + checksum a prebuilt TVM library
// that this links to, so that you don't have to wait for a whole compiler to compile.
// The download will only be ~50mb -- way smaller and easier than lots of other deep
// learning frameworks. It will also support running code on things besides cuda!
// The output binary won't need to link the compiler (by default) and will therefore be
// only a few megabytes.
extern crate tvmrs;

// a procedural macro that converts Rust code to Relay IR.
// Relay IR is TVM's high-level IR for defining neural networks / computation chains,
// sorta like a tensorflow Graph. It's also not too dissimilar to Rust.
// The macro will compile the IR with TVM at build-time, and link the resulting artifacts
// to this rust library.
tvmrs::accelerate! {

  // stateless operation
  fn relu_downsample(x: Tensor[c, n, h, w]) -> Tensor[c, n, h/2, w/2] {
     relu(downsample(x))
  }

  // stateful operation
  struct Block<oc> {
    conv: Conv2d<3,3,oc>,
    elu: Elu
  }
  impl Op for Block<oc> {
    fn run(self, input: Tensor[c, n, h, w]) -> Tensor[oc, n, h, w] {
       self.elu(self.conv(input))
    }
  }

  fn swap_channels(x: Tensor[2, n, h, w]) -> Tensor[2, n, h, w] {
    // a low-level tensor operation defined as a TVM Tensor expression.
    let out = compute!(x.shape, |cc, nn, hh, ww| x[(c + 1) % 2, nn, hh, ww)]);
    out
  }
 
  // a sequential network container.
  sequential! Network {
     #[opencl] Block<3,3,5>, // run on opencl
     #[opencl] relu_downsample,
     #[opencl] Conv2d::new(1,1,2),
     #[rust] debug,    // call a normal rust function
     #[cpu] swap_channels // run this part on CPU to maximize throughput
  }

  // Compute a derivative of the network.
  // Relay IR is designed to be differentiable.
  derivative! NetworkDerivative (Network);
}

// a normal rust function
fn debug(x: Tensor) {
  ...
}

train.rs:

fn main() {
  tvmrs::training_loop! {
    net: Network,
    dnet: NetworkDerivative,
    epochs: 37,
    training_data: dataset! {...},
    valid_data: dataset! {...},
    ...
  }
}

run.rs:

fn main() {
   let input = tvmrs::ndarray_from_stdin();
   let output = Network::load_params("params.bin").run(input);
   println!("{:?}", output);
}

(Further reading: Introduction to Relay, TVM Tensor expressions)

All of this is of course pending mountains of bikeshedding, i have no idea what the final API will look like.

One of the nifty things here is that this isn't limited to deep learning models. TVM can handle pretty much any algorithm made of large array operations. So if you wanted to run your SVM on GPU, you can do that pretty easily!

Steps to take here:

  • Talk to the TVM people and see what they think of all this. We could do this work under their umbrella or in a fresh project.
  • Write Rust bindings to the TVM compiler (instead of just the runtime). TVM is written in C++ but is designed to be easy to bind, a lot of the work has already been done here.
  • Design an API like my sketch above that wraps the bindings in some way that makes them easy to use for training + deployment.
  • Build up cargo tooling to allow e.g. prebuilt binary downloads, TVM's auto-tuner support, etc.
  • Beef up TVM's autodifferentiation support. TVM can differentiate Relay IR, but a lot of derivatives aren't actually implemented yet. We could also roll our own autodifferentiation system and just use TVM for compilation; I'd prefer to avoid duplicating work tho.
  • Start writing non-deep-learning algorithms with this system as well, to kick the tires.

If people are interested in this implementation path we could throw a repo together and start work.

I mainly want this because I'm don't want to be stuck using Python and Cuda all the time for my deep learning research :)))

kazimuth avatar May 01 '19 20:05 kazimuth

A few months ago I have started a crate of my own for deep learning. My goal is to have a library which:

  • Supports both inference and training.
  • Supports the most common deep neural network architectures.
  • Is GPU accelerated.
  • Doesn't use CUDA.
  • Supports every mainstream platform (Linux, MacOS, iOS, Android, Windows, WebAssembly) and hardware (AMD, NVIDIA, Intel GPUs) with a single codebase, and uses the same kernels for consistent results.
  • Is written in pure Rust so that it's trivial to cross-compile.
  • Has a simple to use Keras-like API.
  • Is small and simple enough that it can be reasonably understood and tested end-to-end. (Otherwise you risk situations like e.g. with TensorFlow where for two whole versions their dropout layer was completely broken.)

It's currently totally useless. Right now I'm in the process of adding a Vulkan backend (I have a few thousand lines of work-in-progress code on my disk which I've not pushed yet.); once I finish that in a few weeks I plan further build it up so that I can train CIFAR-10 up to at least ~90% accuracy, add some model import/export functionality (probably from/to the ONNX format) and only then it will be actually usable for something practical.

Some people would call this a waste of time and effort, and, well, I do agree that it would be probably more productive to not do this completely from scratch as I'm doing (e.g. by using TVM as kazimuth said), but I don't really care - I'm just trying to scratch my own itch.

koute avatar May 01 '19 20:05 koute

@kazimuth while I love the snippets you've shown here, a lot of my love for Rust exists because of the all the compile time checks the compiler does, and the wonderfully easy to comprehend error messages. I feel that if one is using Rust just as a way to compose, and run functionality defined in other languages then there isn't much to gain here. Might as well just use Python.

And TVM looks more like a tool for deploying neural nets rather than training them; which is very useful but I would prefer to do both in Rust.

There's also tch-rs - bindings to PyTorch's libtorch.

Something else that is also interesting is dual_num, which as I best understand it is some fancy math that might eventually let us to automatic differentiation.

DhruvDh avatar May 01 '19 20:05 DhruvDh

@koute the long term road-map is amazing but I don't get why bother putting effort into the tensorflow backend. Admittedly I don't have enough know-how to imagine what a native backend would look like and the kind of work it would need.

DhruvDh avatar May 01 '19 20:05 DhruvDh

@DhruvDh The TensorFlow backend will be most likely removed in the future. Currently it is there for a few reasons:

  • I wanted to quickly get something working to experiment with, and to be able to first work on the general interface of the library (e.g. defining the neural network graph, getting data in and out, etc.)
  • I can use it to write a comprehensive test suite and then cross-check that with my own backend. ML algorithms are very hard to write correctly, so I want the extra insurance not only that my algorithms match with what I have on paper, but also with another widely used framework. (Although from the amount of bugs I've encountered when dealing with TensorFlow it'd probably would have been better to pick a different framework...)

koute avatar May 01 '19 20:05 koute

Some cool stuff coming to light. Is anyone familiar with work presented at c4ML? https://www.c4ml.org/ I don't think any of the presentations were using Rust... but certainly this is a space Rust could be competitive with. With that in mind, are any of the Rust compiler team interested in ML?

Here are some references to work being done in Swift and Julia (Note, Rust, Swift, Julia were all top of the list for google's tensorflow project that eventually became swift-for-tf). (e.g., automatic differentiation, differentiable programming... https://github.com/tensorflow/swift/blob/master/docs/AutomaticDifferentiation.md, https://juliacomputing.com/blog/2019/02/19/growing-a-compiler.html). Swift MLIR (https://drive.google.com/file/d/1hUeAJXcAXwz82RXA5VtO5ZoH8cVQhrOK/view) and Julia Zygote (https://www.julialang.org/blog/2018/12/ml-language-compiler).

I don't know of any projects in Rust along these lines ^^ ... of course, they are also all funded (google, and julia computing).

jbowles avatar May 01 '19 21:05 jbowles

@koute yeah makes sense.

@jbowles There was this internals thread about Automatic Differentiation here.

DhruvDh avatar May 01 '19 21:05 DhruvDh

@ehsanmok may be interested in this discussion ^^

thanks @DhruvDh

jbowles avatar May 01 '19 21:05 jbowles

@DhruvDh that's a fair criticism, but really that's a problem whenever you want to use a hardware accelerator. You're always going to be calling into a language with different semantics from the host. Using Rust for glue gives you type-safety, performance, and lovely tooling. e.g. it's dead-simple to write a parallel image preprocessing pipeline in Rust, whereas with python you need a load of hacks (FFI, multiprocessing) to get acceptable performance. Also, you're free to define new low-level operations in Rust; users shouldn't ever need to use another language :)

And yeah, currently TVM's publicity is oriented around deployment, because that's where there's a gap in the python ecosystem. There's no reason their compiler wouldn't work for training too, though.

@jbowles I've worked with some of those projects; see my comment, I think we can borrow some of that work.

also CC @nhynes

kazimuth avatar May 01 '19 21:05 kazimuth

Other thought: I wonder what interactive scientific programming would look like in Rust? There's a jupyter kernel but i'm not sure how usable it is.

It might be that rust should just be used for high-performance kernels and stuff, and be easy to call from other languages like you lay out in your presentation @LukeMathWalker.

kazimuth avatar May 01 '19 21:05 kazimuth

Wow, there really is a lurking interest 😛 This is just great.

The discussion has explored several different directions, I'd like to give more details on what I do envision (and where that need comes from).

I strongly align with @flo-dhalluin: I think Rust can really shine in delivering an end-to-end production workflow. Rust has incredible potential when it comes to the beginning (data pipelines, preprocessing) and the end (performance web servers, using multiple protocols) of the ML workflow. Establishing early on a way to get the whole workflow is going to be a key prerequisite for adoption - filling a painful gap in the ML ecosystem at large, delivering a top-notch experience with great tooling.

Tackling this challenge requires the building blocks I mentioned (n-dimensional arrays, dataframes) and some others that have been brought up (e.g. running code on different types of hardware, easy interop, reading/writing to a lot of different formats).

Certain capabilities can be borrowed from other languages, others we should probably port and develop natively in Rust (a sufficiently large zoo of preprocessing techniques and standard models).

While I do understand the interest in the Deep Learning area, I don't think it's realistic to kickstart an effort to make Rust a primary language for NN development: we should definitely be able to deploy and run NN models (the TVM project is an excellent example here), but I don't think we would be adding a lot of value by chasing huge projects like TensorFlow or PyTorch. There are a lot of things in the TensorFlow ecosystem, instead, that are extremely interesting (e.g. TensorFlow serving) but they do end up locking you into TensorFlow itself: if we could replicate those conveniences in a framework-agnostic fashion, we could definitely capture a need in that space.

Summing it up, the minimum working prototype that I have in mind to show off what Rust can do goes along these lines:

  • Huge datasets as input;
  • Heavy-weight, massively parallel data preprocessing pipeline (e.g. NLP or images would be good candidates);
  • Very simple model to be trained on top of the pipeline output;
  • Configuration-based deployment of the serialized model using Rocket: you just define very basic things in a YAML file (e.g. HTTP vs gRPC, monitoring, logging, etc.) and you get a fully working web server that serves your model. This will have to rely on a sufficiently general Model trait.

If you could manage to get the experience right, I am quite sure the interest in Rust for this kind of use cases would skyrocket.

LukeMathWalker avatar May 01 '19 22:05 LukeMathWalker

While I do understand the interest in the Deep Learning area, I don't think it's realistic to kickstart an effort to make Rust a primary language for NN development: we should definitely be able to deploy and run NN models (the TVM project is an excellent example here), but I don't think we would be adding a lot of value by chasing huge projects like TensorFlow or PyTorch.

I agree, however, you're looking at it from a perspective of a data scientist who wants to fill in the gaps of their existing workflow and augment their ML pipeline with Rust. I'm looking at it from a perspective of a Rust developer who just wants to augment their existing application with a little ML without going through the hoops of exporting their data, processing it through a mainstream ML framework, and serializing it back so that it can be used by the application again.

In other words - my personal interest lies in not filling a gap in the existing ML ecosystem (although that's also most certainly worthwhile!), but in filling a gap in the Rust ecosystem by creating value for existing Rust users (and perhaps the users of other languages) so that they could take advantage of ML in a plug-and-play fashion with minimal amount of fuss. (Which is why things like wide hardware and platform support, simplicity, lack of non-Rust dependencies so it's easy to build and cross-compile, etc. is important.)

koute avatar May 01 '19 22:05 koute

I can volunteer work to rust-ml for tokenizers, string distance metrics, and/or onehot encoding package. I've already been working on the first two as I have real-world projects that need these so I can double up. As far as a onehot package I'm interested to learn more how efficient onehot encoding is done under the hood and have a use for the package as well.

  • string distance metrics (jaro, jaro-winkler, ngram, qgram, ratcliff-obershelp)

  • tokenizers: for one, rust is awesome for writing tokenizers. But IME it's kinda hard to write general tokenizers since their use is often highly dependent on per-project needs (for example I wrote this [https://github.com/jbowles/nlpt-tkz] and used it for a project and its not found much use since). Or if there were consensus on using something ntlk tokenizers as a guide I don't mind working on those either. If there is a need for things like the examples below then I can cherry pick these out of my current project (a hotel and product matching thing) for a rust-ml package... these were written specifically for string comparison and not typical tokenization found in nlp pipelines but it would not be to hard adapt them to accept and return a specific data type...

#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn on_word_splitter() {
        fn word_split(c: char) -> bool {
            match c {
                '\n' | '|' | '-' => true,
                _ => false,
            }
        }
        let res = TokenizerNaive::word_splitter("HelLo|tHere", &word_split);
        assert_eq!(res, vec!["HelLo", "tHere"])
    }
    #[test]
    fn on_tokens_lower_filter() {
        fn tokens_filter(c: char) -> bool {
            match c {
                '-' | '|' | '*' | ')' | '(' | '&' => true,
                _ => false,
            }
        }
        let res = TokenizerNaive::tokens_lower_with_filter("|HelLo tHere", &tokens_filter);
        assert_eq!(res, " hello there");

        let res1 = TokenizerNaive::tokens_lower_with_filter("HelLo|tHere", &tokens_filter);
        assert_eq!(res1, "hello there");

        let res2 = TokenizerNaive::tokens_lower_with_filter("HelLo tHere", &tokens_filter);
        assert_eq!(res2, "hello there");

        let res6 =
            TokenizerNaive::tokens_lower_with_filter("****HelLo *() $& )(tH*ere", &tokens_filter);
        assert_eq!(res6, "    hello     $    th ere");
    }

    #[test]
    fn on_pre_process() {
        let res = TokenizerNaive::pre_process("Hotel & Ristorante Bellora");
        assert_eq!(res, "hotel ristorante bellora");

        let res1 = TokenizerNaive::pre_process("Auténtico Hotel");
        assert_eq!(res1, "auténtico hotel");

        let res2 = TokenizerNaive::pre_process("Residence Chalet de l'Adonis");
        assert_eq!(res2, "residence chalet de l adonis");

        let res6 = TokenizerNaive::pre_process("HOTEL EXCELSIOR");
        assert_eq!(res6, "hotel excelsior");

        let res6 = TokenizerNaive::pre_process("Kotedzai Trys pusys,Pylimo ");
        assert_eq!(res6, "kotedzai trys pusys pylimo");

        let res6 = TokenizerNaive::pre_process("Inbursa Cancún Las Américas");
        assert_eq!(res6, "inbursa cancún las américas");
    }

    #[test]
    fn on_tokens_alphanumeric() {
        let res3 = TokenizerNaive::tokens_alphanumeric("|HelLo tHere");
        assert_eq!(res3, " HelLo tHere");

        let res4 = TokenizerNaive::tokens_alphanumeric("HelLo|tHere");
        assert_eq!(res4, "HelLo tHere");

        let res5 = TokenizerNaive::tokens_alphanumeric("HelLo * & )(tHere");
        assert_eq!(res5, "HelLo       tHere");
    }

    #[test]
    fn on_tokens_lower() {
        let res = TokenizerNaive::tokens_lower_str("HelLo tHerE");
        assert_eq!(res, "hello there")
    }

    #[test]
    fn on_tokens_simple() {
        assert_eq!(
            TokenizerNaive::chars("hello there"),
            ["h", "e", "l", "l", "o", " ", "t", "h", "e", "r", "e"]
        );
        assert_eq!(
            TokenizerNaive::chars("hello there").concat(),
            String::from("hello there")
        )
    }

    #[test]
    fn on_similarity_identity() {
        assert_eq!(TokenCmp::new_from_str("hello", "hello").similarity(), 100);
    }

    #[test]
    fn on_similarity_high() {
        assert_eq!(TokenCmp::new_from_str("hello b", "hello").similarity(), 83);
        assert_eq!(
            TokenCmp::new_from_str("this is a test", "this is a test!").similarity(),
            97
        );
        assert_eq!(
            TokenCmp::new_from_str("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear").similarity(),
            91
        );
    }
    #[test]
    fn on_token_sequencer() {
        let an = AlphaNumericTokenizer;
        let one = an.sequencer("Marriot &Beaches Resort|").join(" ");
        let two = an.sequencer("Marriot& Beaches^ Resort").join(" ");
        assert_eq!(one, two);
    }
    #[test]
    fn on_token_sort() {
        let s1 = "Marriot Beaches Resort foo";
        let s2 = "Beaches Resort Marriot bar";
        assert_eq!(TokenCmp::new_from_str(s1, s2).similarity(), 62);
        let sim = token_sort(s1, s2, &TokenCmp::new_sort, &TokenCmp::similarity);
        assert_eq!(sim, 87);
    }
    #[test]
    fn on_token_sort_again() {
        let s1 = "great is scala";
        let s2 = "java is great";
        assert_eq!(TokenCmp::new_from_str(s1, s2).similarity(), 37);
        let sim = token_sort(s1, s2, &TokenCmp::new_sort_join, &TokenCmp::similarity);
        assert_eq!(sim, 81);
    }
    #[test]
    fn on_amstel_match_for_nate() {
        let sabre = "INTERCONTINENTAL AMSTEL AMS";
        let ean = "InterContinental Amstel Amsterdam";
        assert_eq!(TokenCmp::new_from_str(sabre, ean).similarity(), 20);
        assert_eq!(TokenCmp::new_from_str(sabre, ean).partial_similarity(), 14);
        assert_eq!(
            token_sort(sabre, ean, &TokenCmp::new_sort, &TokenCmp::similarity),
            79
        );

        assert_eq!(
            token_sort(
                sabre,
                ean,
                &TokenCmp::new_sort,
                &TokenCmp::partial_similarity
            ),
            78
        );
    }

    #[test]
    fn on_partial_similarity_identity() {
        let t = TokenCmp::new_from_str("hello", "hello");
        assert_eq!(t.partial_similarity(), 100);
    }

    #[test]
    fn on_partial_similarity_high() {
        let t = TokenCmp::new_from_str("hello b", "hello");
        assert_eq!(t.partial_similarity(), 100);
    }

    #[test]
    fn on_similarity_and_whitespace_difference() {
        let t1 = TokenCmp::new_from_str("hello bar", "hello");
        let t2 = TokenCmp::new_from_str("hellobar", "hello");
        let sim1 = t1.similarity();
        let sim2 = t2.similarity();
        assert_ne!(sim1, sim2);
        assert!(sim1 < sim2);
        assert_eq!(sim1, 71);
        assert_eq!(sim2, 77);
    }

jbowles avatar May 01 '19 23:05 jbowles

Summing it up, the minimum working prototype that I have in mind to show off what Rust can do goes along these lines:

This is a very cool idea :)

Question: what would a general Model trait look like? I think the challenge is striking a balance between generality and specificity; you don't want to tie people down too much, but you need some sort of understanding of what you're doing to be able to use it in a general context.

We might want to brainstorm a list of goals / requirements for the design, before we start writing code. Maybe in another issue?

@jbowles

But IME it's kinda hard to write general tokenizers since their use is often highly dependent on per-project needs

Do you think it would be possible to do something with a trait-based approach here? Like, the rust pattern of building up a stack of combinators, you get Parallel<Lower<UnicodeSplitter<...>>> and it ends up near-handwritten performance? I don't know much about NLP, so forgive me if I'm missing stuff here.

kazimuth avatar May 02 '19 00:05 kazimuth

@kazimuth yes i think that would be the way; allow the user to compose a tokenizer.

The TokenizerNaive i showed above is naive specifically because it is not trait based; it does some text normalization for the user, allowing the user to build and pass in a function for char matching/filtering.

I do have a trait-based approach (ideas i got from this Text-Analysis-in-Rust-Tokenization) in my current project but those are in service of tokenizing for comparing token similarity.

With full-blown tokenization an API should support allowing a user to compose the various things they need (e.g., a char filter, normalizing text, etc...) like your example. The hard part I'm really referring to is the output of the tokenization. For example,

I have functionsequencer that returns Vec of tokens

Vec<std::borrow::Cow<'a, str>>;

First, I'm new enough to rust to still not totally understand all the consequences of using Cow :) ... and also instead of a Vec<> it likely needs to return a different kind of vector that plays well with onehot or word embeddings, etc... If you are familiar with python scikit-learn think of the "Vectorizers" it has for turning arrays of strings into arrays of numbers [IMO this is always the hardest part of NLP]

texts = ["foo bar", "bar foo zaz", "did bar", "zaz bar jazz", "good jazz zaxx"]

tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
pd.DataFrame(features.todense(), columns=tfidf.get_feature_names())

d_vtz = CountVectorizer()
print(d_vtz.fit_transform(texts))

h_vtz = HashingVectorizer()
print(h_vtz.fit_transform(texts)

It seems what one would want in rust is a tokenizer that returns vectors of tokens that can just be "plugged in" to lots of different ways to turn text into numbers.

jbowles avatar May 02 '19 02:05 jbowles

I'd like to see more individual specialised components that form part of an ML pipeline, rather than anything monolithic attempting to implement too much at once.

This gives Rust a chance to build up its ML strengths, slowly replacing individual parts of a mature ML pipeline. Having e.g. python bindings to those components would also allow them to start getting used and proving benefit, without needing a 100% switch to Rust.

Modules/components I'd love to see:

  • text vectorisation (e.g. fast/parallel versions of count/tfidf vectorisers)
  • dimensionality reduction (e.g. PCA, tSNE)
  • scaling/normalisation
  • hyperparameter optimisation
  • data stucture interop (e.g. to/from pandas/arrow/parquet etc.)

davechallis avatar May 02 '19 09:05 davechallis

Question: what would a general Model trait look like? I think the challenge is striking a balance between generality and specificity; you don't want to tie people down too much, but you need some sort of understanding of what you're doing to be able to use it in a general context.

We might want to brainstorm a list of goals / requirements for the design, before we start writing code. Maybe in another issue?

An article I found very interesting, from 2 years ago, is this one: http://athemathmo.github.io/2016/09/07/typesystem-machine-learning.html It's from the author of rusty-machine if I am not mistaken. We should definitely brainstorm a list of goals and requirements here before starting to write code out. It would also be worthwhile to see what features in the lang team pipeline could be useful for us.

I agree, however, you're looking at it from a perspective of a data scientist who wants to fill in the gaps of their existing workflow and augment their ML pipeline with Rust. I'm looking at it from a perspective of a Rust developer who just wants to augment their existing application with a little ML without going through the hoops of exporting their data, processing it through a mainstream ML framework, and serializing it back so that it can be used by the application again.

In other words - my personal interest lies in not filling a gap in the existing ML ecosystem (although that's also most certainly worthwhile!), but in filling a gap in the Rust ecosystem by creating value for existing Rust users (and perhaps the users of other languages) so that they could take advantage of ML in a plug-and-play fashion with minimal amount of fuss. (Which is why things like wide hardware and platform support, simplicity, lack of non-Rust dependencies so it's easy to build and cross-compile, etc. is important.)

My loyalty is divided, to say the least: I'd love to be able to host 100% of my workflow in Rust because I strongly believe in the language potential and in the potential of the tooling around it. I wouldn't say though that our goals are at odds @koute : it's just a matter of deciding in which order we should be tackling challenges. A good set of crates for preprocessing and deployment is going to be just as necessary for a purely Rust-based workflow as they are for a mixed-language workflow. Once they are established, we can then shift focus on porting more and more models and algorithm to Rust. I wholeheartedly agree with @davechallis:

I'd like to see more individual specialised components that form part of an ML pipeline, rather than anything monolithic attempting to implement too much at once. This gives Rust a chance to build up its ML strengths, slowly replacing individual parts of a mature ML pipeline. Having e.g. python bindings to those components would also allow them to start getting used and proving benefit, without needing a 100% switch to Rust.

Thanks to the strong packaging and distribution story provided by Rust, the effort of flashing out algorithms and preprocessing tools can be extremely distributed: once there is a set of agreed-upon traits as interfaces, we can leverage the influx of people who are fascinated and allow them to be productive and develop new crates without having to worry about the fundamentals. That's why I think it's strategical to have a pure Rust implementation of DataFrames and n-dimensional arrays, for instance. We don't need a huge monolith like SciPy or Scikit-learn.

LukeMathWalker avatar May 02 '19 09:05 LukeMathWalker

@kazimuth that Jupyter kernel is usable; I'm starting to learn ai with it in here: https://github.com/swfsql/deep-learning-coursera (by oxidizing python code) (Currently, only the first assignment is in Rust)

swfsql avatar May 02 '19 13:05 swfsql

This gives Rust a chance to build up its ML strengths, slowly replacing individual parts of a mature ML pipeline. Having e.g. python bindings to those components would also allow them to start getting used and proving benefit, without needing a 100% switch to Rust. 💯

Seems to me one of the more difficult problems doing this in rust is getting common types and traits and types defined for different packages to interface with. If I'm not mistaken @LukeMathWalker you seem to point towards using Ndarray as basically numpy. I'm all on board with that.

What if there were something like a core package that defined some of the core traits and structs and types? I can see lots of pros/cons for doing that.

jbowles avatar May 02 '19 13:05 jbowles

@jbowles RE: tokenizer API Hm, I see the challenge there. Well, for one thing you should probably use Iterators in between operations instead of Vecs, or design a similar trait to Iterator; that should reduce the problem of having to have big buffers between each transformation. Then I think the path would be to pick-and-choose input requirements for each operation, and then operations output whatever they want. E.g. HashVectorizer takes impl Iterator<impl Deref<str>>, and then users can pass in Iterator<&str>, Iterator<String>, Iterator<Cow>, whatever.

This gets at a broader problem with a simple function-y Model(Input) -> Output trait; it works for in-memory datasets, but once your dataset is large enough that you want to start streaming / distributing work over multiple machines, the abstraction sorta breaks down. We could instead do something graphy, where you just have nodes that ingest and spit out streams of data... but then we'll have to work with something graphy, with nodes that ingest and spit out streams of data :P

It might make sense to just start implementing without a core crate of traits, and once we've smacked into enough walls in the design space, we can figure out what the interfaces to our systems tend to look like, and retrofit a core design around that.

kazimuth avatar May 02 '19 15:05 kazimuth

Although I'm not sure that Rust is going to usurp Python and C++ as the de-facto ML programming model, it's definitely a worthy goal. Along those lines, I think that flashlight (and the underlying arrayfire library) has an interface that we might want to emulate.

In any case, the real key feature of PyTorch and JAX is the expressivity of Python backed by a high-performance JIT tensor compiler. I'm pretty sure it's possible to do something similar in Rust by writing a compiler plugin that tracks the types+ops of ndarrays and provides the data to a JIT compiler.

Maybe something like

#[jit]
fn mlp(
    data: &Array<2, f32>,
    weights: Vec<&Array<2, f32>,
    labels: &Array<1, u8>
) -> f32 {
    let fc1 = data.dot(weights[0]); // fn dot -> Array<D, T, Op=gemm>
    Array::pointwise_max(0, fc1) // Array<D, T, Op=Max<0, fc1>> 
}

This is just a sketch and depends on how cost generics actually pan out, but the idea is that a compiler plugin can find the #[jit] functions and either pre-compile them or add them to a runtime cache and replace the original definition with a call into the cache. This is not quite dissimilar to TVM's hybrid mode. We probably don't want to write a tensor compiler, so we could offload that to TVM and link in the static library.

nhynes avatar May 02 '19 16:05 nhynes