minbpe icon indicating copy to clipboard operation
minbpe copied to clipboard

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

Results 50 minbpe issues
Sort by recently updated
recently updated
newest added

I thoroughly enjoy Karpathy's YouTube content; it's consistently top-notch. I've been wondering whether this could potentially evolve into a concise course on platforms like Coursera or edX. The content he...

**Thanks for your nice work! I have a question after reading basic.py, and I want to figure out why...** - In the save function implementation of basic.py, the BBPE vocab...

Thanks for the nice repo! 🙂 Not sure if it's welcome here given that the goal of this repo is to be for educational purposes but you call the `get_stats`...

Hello! I don't remember if I'd shown you https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py , but consider stealing the token visualisation code from here in some form: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186 I've found it can be quite useful...

Training the tokenizer is memory intensive. It needs hundreds of GBs of RAM to train a tokenizer. What about using memmap to load only the required portion of the data...

Using .isprintable() method to improve the readability of the function (the intent of the verification becomes clearer) and to reduce the explicit `unicodedata` import.

Hi! I'm not sure if this is the appropriate place for posting this, I'm sorry if it is not. I think there is a way to make the training of...

Batch encoding and decoding: ```python from minbpe import BasicTokenizer tokenizer = BasicTokenizer() tokenizer.train(very_long_training_string, vocab_size=4096) tokenizer.encode_batch(["hello world", "bye world"]) # list[string] -> list[tokens] tokenizer.decode_batch([[1000, 2000, 3000], [1000, 2000, 3000]]) # list[tokens]...

Implement a "token-free" or tokenization free encoder to work at Unicode/UTF-8 character-level. Examples - CANINE, Unicode code points [HF](https://huggingface.co/docs/transformers/model_doc/canine), [arxiv](https://arxiv.org/abs/2103.06874) - ByT5, UTF-8, [HF](https://huggingface.co/google/byt5-small), [arxiv](https://arxiv.org/abs/2105.13626)