minbpe
minbpe copied to clipboard
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
Background: I developed a rudimentary way to reduce token count for long prompts by concatenating words of a certain length, which has the potential to reduce API token costs by...
@karpathy Thanks for the great lecture and implementation! As always, it was a pleasure. I have tried to implement LlamaTokenizer (without using sentencepiece backend) staying as close to minbpe implementation...
Locally, the file that contains the RegexTokenizer is named "regex.py" and it conflicts with the regex module. It caused silly errors and it took me a couple minutes to figure...
The length of input ids is not changing inside the `merge()` function. Instead of calling `len(ids)` in every iteration of the while loop, storing it in a variable at the...
I've added a link to a Mojo port of minbpe. While this port functionally mirrors minbpe, it has a different design due to the current language constraints of Mojo, which...
In the collab notebook here: https://colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L, refer to the sentencepice section. Defining the options dictionary using `dict` and then unpacking it results in `TypeError: 'Dictionary' object is not callable` ```...
``` >>> import regex as re >>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" ) >>> str = r"""हहिन्दी विकिपीडिया""" >>> print (re.findall(gpt2pat, str )) ['हह', 'िन', '्द', 'ी', ' व', 'िक', 'िप',...
Fixing the problem that all tokenizers have with regard to all combining marks like diacritics, Indic Matras (vowels after consonants) Indic Halant, Arabic, Hebrew etc. This was probably breaking most...
For example, you want to count 2 (not 4) occurrences of the pair 'aa' in text 'aaaaa', because merge() can replace it just 2 times. In other words the counted...