minbpe
minbpe copied to clipboard
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
I thoroughly enjoy Karpathy's YouTube content; it's consistently top-notch. I've been wondering whether this could potentially evolve into a concise course on platforms like Coursera or edX. The content he...
**Thanks for your nice work! I have a question after reading basic.py, and I want to figure out why...** - In the save function implementation of basic.py, the BBPE vocab...
Faster BPE
Thanks for the nice repo! 🙂 Not sure if it's welcome here given that the goal of this repo is to be for educational purposes but you call the `get_stats`...
Hello! I don't remember if I'd shown you https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py , but consider stealing the token visualisation code from here in some form: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186 I've found it can be quite useful...
Training the tokenizer is memory intensive. It needs hundreds of GBs of RAM to train a tokenizer. What about using memmap to load only the required portion of the data...
Using .isprintable() method to improve the readability of the function (the intent of the verification becomes clearer) and to reduce the explicit `unicodedata` import.
Hi! I'm not sure if this is the appropriate place for posting this, I'm sorry if it is not. I think there is a way to make the training of...
Batch encoding and decoding: ```python from minbpe import BasicTokenizer tokenizer = BasicTokenizer() tokenizer.train(very_long_training_string, vocab_size=4096) tokenizer.encode_batch(["hello world", "bye world"]) # list[string] -> list[tokens] tokenizer.decode_batch([[1000, 2000, 3000], [1000, 2000, 3000]]) # list[tokens]...
Implement a "token-free" or tokenization free encoder to work at Unicode/UTF-8 character-level. Examples - CANINE, Unicode code points [HF](https://huggingface.co/docs/transformers/model_doc/canine), [arxiv](https://arxiv.org/abs/2103.06874) - ByT5, UTF-8, [HF](https://huggingface.co/google/byt5-small), [arxiv](https://arxiv.org/abs/2105.13626)