minbpe icon indicating copy to clipboard operation
minbpe copied to clipboard

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

Results 61 minbpe issues
Sort by recently updated
recently updated
newest added

Add error handling to the `load` method in `base.py`, specifically handling cases where the model file format might be incorrect or corrupted. Modifications include: 1. Checking the model file version...

When Karpathy claimed an efficient implementation of the BPE optimizer doesn't exist, I did some research and found this on Hugging Face: https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/bpe/trainer.rs Isn't this exactly what Karpathy was creating?

Without the padding, the sentences end up being different sizes and we get stacking errors at data loading time.

I've been experimenting with my own little BPE implementations in other programming languages. It seems that major bottlenecks are counting the frequencies of the pairs each iteration and merging the...

πŸ‘‹πŸ»πŸ™‹πŸ»β€β™‚οΈ **Hello @karpathy ,** Firstly, I apologize for creating an "issue," but it seems to be an effective way to reach you. I have been following your lectures since last...

i implemented a much faster training/tokenization algorithm in c++ and called the functions using ctypes, so that they can be used conveniently in python. The performance gain is huge, i...

[Gregor Purdy (@gnp)](https://github.com/gnp) is working on a Rust version of `minbpe`: [minbpe-rs](https://github.com/gnp/minbpe-rs) The Rust crate (similar to a package in Python) contains the three tokenizers currently included in the Python...

A little more reliability I guess?

It appears that the decode() method in the GPT4Tokenizer class does not handle special tokens. I submitted a pull request (#63) with some updated code, but also wanted to post...

…okens. Copy-pasted decode() method from class RegexTokenizer to allow handling of special tokens; and added two lines to "unshuffle" bytes objects.