minbpe icon indicating copy to clipboard operation
minbpe copied to clipboard

`minbpe-rs`: A pure Rust implementation of `minbpe`

Open shubham0204 opened this issue 10 months ago • 2 comments

Gregor Purdy (@gnp) is working on a Rust version of minbpe: minbpe-rs

The Rust crate (similar to a package in Python) contains the three tokenizers currently included in the Python version of minbpe: BasicTokenizer, RegexTokenizer and the GPT4Tokenizer. Here's an example, similar to the one in the README of this project, but using minbpe-rs,

use std::path::Path;
use minbpe::{BasicTokenizer, Saveable, Tokenizer, Trainable};

fn main() {
    let text = "aaabdaaabac" ;
    let mut tokenizer = BasicTokenizer::new() ;
    tokenizer.train( text , 256 + 3 , false ) ;
    println!( "{:?}" , tokenizer.encode(text) ) ;
    println!( "{:?}" , tokenizer.decode( &[258, 100, 258, 97, 99] ) ) ;
    tokenizer.save( Path::new( "./" ) , "toy" ) ;
}

which on execution prints,

$> cargo run

   ...
   Compiling minbpe-test v0.1.0 (~/minbpe-test)
    Finished dev [unoptimized + debuginfo] target(s) in 15.71s
     Running `target/debug/minbpe-test`
[258, 100, 258, 97, 99]
"aaabdaaabac"

@gnp is the lead developer with me, @shubham0204, working on the docs, examples and the README of the project.

  • minbpe-rs will be a good start for the 2nd point in todo section of the README: write an even more optimized C or Rust version (think through)
  • The project also contains a test comparing RegexTokenizer with the GPT-4 tokenizer from tictoken-rs(Rust version of tictoken), similar to inference: GPT-4 comparison from the README. See the test here.
  • Currently, the project has a base level of documentation, which can be enriched by adding more docstrings and examples for the tokenizers

It would be great if minbpe-rs can be added as a community extension in the README of this repository, encouraging more developers to work on this Rust implementation and build more features into it (ex. Python bindings, multi-threading support, or wrappers for Java/C). We would like the community to review minbpe-rs and provide their feedback or contributions.

shubham0204 avatar Apr 21 '24 15:04 shubham0204

submit a PR happy to merge

karpathy avatar Apr 21 '24 17:04 karpathy

@karpathy Thanks! Here's the PR #67

shubham0204 avatar Apr 22 '24 01:04 shubham0204