minbpe issues

Would using prompts that contain concatenated words to reduce token count negatively affect results

Background: I developed a rudimentary way to reduce token count for long prompts by concatenating words of a certain length, which has the potential to reduce API token costs by...

hatgit

Implementation of LlamaTokenizer (without sentencepiece)

@karpathy Thanks for the great lecture and implementation! As always, it was a pleasure. I have tried to implement LlamaTokenizer (without using sentencepiece backend) staying as close to minbpe implementation...

MaveriQ

"regex.py" file name conflict

Locally, the file that contains the RegexTokenizer is named "regex.py" and it conflicts with the regex module. It caused silly errors and it took me a couple minutes to figure...

mogomaa79

calling len(ids) in merge() function only once to increase performance

The length of input ids is not changing inside the `merge()` function. Instead of calling `len(ids)` in every iteration of the while loop, storing it in a variable at the...

crpatil1901

Link to Mojo port added

I've added a link to a Mojo port of minbpe. While this port functionally mirrors minbpe, it has a different design due to the current language constraints of Mojo, which...

dorjeduck

Notebook Issue In Google Colab

In the collab notebook here: https://colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L, refer to the sentencepice section. Defining the options dictionary using `dict` and then unpacking it results in `TypeError: 'Dictionary' object is not callable` ```...

kelixirr

The regular expressions break all scripts with combining marks in the middle of the syllable

3

``` >>> import regex as re >>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" ) >>> str = r"""हहिन्दी विकिपीडिया""" >>> print (re.findall(gpt2pat, str )) ['हह', 'िन', '्द', 'ी', ' व', 'िक', 'िप',...

ajaykg

Update regex.py to correctly parse scripts with combining marks

5

Fixing the problem that all tokenizers have with regard to all combining marks like diacritics, Indic Matras (vowels after consonants) Indic Halant, Arabic, Hebrew etc. This was probably breaking most...

ajaykg

Count only nonoverlapping occurences of a pair

For example, you want to count 2 (not 4) occurrences of the pair 'aa' in text 'aaaaa', because merge() can replace it just 2 times. In other words the counted...

Majdoddin

What to support GPT-4O tokenizer？

echo-valor

minbpe
minbpe copied to clipboard

Metadata

Would using prompts that contain concatenated words to reduce token count negatively affect results

Implementation of LlamaTokenizer (without sentencepiece)

"regex.py" file name conflict

calling len(ids) in merge() function only once to increase performance

Link to Mojo port added

Notebook Issue In Google Colab

The regular expressions break all scripts with combining marks in the middle of the syllable

Update regex.py to correctly parse scripts with combining marks

Count only nonoverlapping occurences of a pair

What to support GPT-4O tokenizer？

← Metadata

Owner

Metadata

minbpe minbpe copied to clipboard

Metadata

← Metadata

Owner

Metadata

minbpe
minbpe copied to clipboard