minbpe
minbpe copied to clipboard
`minbpe-rs`: A pure Rust implementation of `minbpe`
Gregor Purdy (@gnp) is working on a Rust version of minbpe
: minbpe-rs
The Rust crate (similar to a package in Python) contains the three tokenizers currently included in the Python version of minbpe
: BasicTokenizer
, RegexTokenizer
and the GPT4Tokenizer
. Here's an example, similar to the one in the README of this project, but using minbpe-rs
,
use std::path::Path;
use minbpe::{BasicTokenizer, Saveable, Tokenizer, Trainable};
fn main() {
let text = "aaabdaaabac" ;
let mut tokenizer = BasicTokenizer::new() ;
tokenizer.train( text , 256 + 3 , false ) ;
println!( "{:?}" , tokenizer.encode(text) ) ;
println!( "{:?}" , tokenizer.decode( &[258, 100, 258, 97, 99] ) ) ;
tokenizer.save( Path::new( "./" ) , "toy" ) ;
}
which on execution prints,
$> cargo run
...
Compiling minbpe-test v0.1.0 (~/minbpe-test)
Finished dev [unoptimized + debuginfo] target(s) in 15.71s
Running `target/debug/minbpe-test`
[258, 100, 258, 97, 99]
"aaabdaaabac"
@gnp is the lead developer with me, @shubham0204, working on the docs, examples and the README
of the project.
-
minbpe-rs
will be a good start for the 2nd point intodo
section of theREADME
: write an even more optimized C or Rust version (think through) - The project also contains a test comparing
RegexTokenizer
with the GPT-4 tokenizer fromtictoken-rs
(Rust version oftictoken
), similar toinference: GPT-4 comparison
from theREADME
. See the test here. - Currently, the project has a base level of documentation, which can be enriched by adding more docstrings and examples for the tokenizers
It would be great if minbpe-rs
can be added as a community extension in the README
of this repository, encouraging more developers to work on this Rust implementation and build more features into it (ex. Python bindings, multi-threading support, or wrappers for Java/C). We would like the community to review minbpe-rs and provide their feedback or contributions.
submit a PR happy to merge
@karpathy Thanks! Here's the PR #67