minbpe
minbpe copied to clipboard
Faster BPE
Thanks for the nice repo! 🙂
Not sure if it's welcome here given that the goal of this repo is to be for educational purposes but you call the get_stats
function, which is computationally heavy, V
times, yielding O(|S| V)
where V
is the vocab size and |S|
is the text size.
Huggingface tokenizers (tokenizers-rs) and other libraries use a live data structure (max heap with pointers) that makes it possible to not recompute the stats after each merge. The improved runtime is O(|S| log V)
.
It is described in Algorithm 2 here: https://aclanthology.org/2023.findings-acl.38.pdf
Here are the two algorithms:
Btw it also has a mini version of BPE if anyone is interested in code-golfing BPE (would be cool to see the limits and hacks, I'm quite bad at it):
Thank you for the pointers!
So I think we should retain the nice and clean and minimal and inefficient version as an algorithmic (and unit test) reference, but in addition to that we could have a version that is optimized for speed (e.g. bpe_regex_fast.py
), and then probably even further lower that, e.g. to C/Rust/other.
So I'm not opposed to optimizations, we'd just keep that functionality separate. I'll either get around to this at some point, or I'd welcome PRs.
One additional model I'm thinking about is what I did with llama2.c, where instead of reviewing tons of PRs for the root repo, it is treated more as a reference and people fork it to their own languages/repos optimized for different purposes, and then I'm happy to link to it from the Readme here. This path is advisable for 1) larger re-writes, 2) where people want to iterate fast themselves and 3) don't want me personally to become a PR bottleneck.
I implemented heap-based token merging for my simplified TinyLlama implementation. Feel free to copy what you need. Here is the function:
https://github.com/99991/SimpleTinyLlama/blob/main/tokenizer.py#L70
This function might be easier to understand after reading the code of the slow implementation first. It can be found below this function in a commend:
https://github.com/99991/SimpleTinyLlama/blob/9af6f7df6e12d8478a90d3cd5c8e8c1a95fce0fe/tokenizer.py#L136
I've actually implemented the above optimizations in Rust: https://github.com/gautierdag/bpeasy/tree/main
It mostly follows what's done in the Huggingface's tokenizers library but in 400 lines, so might be an easier guide for anyone trying to port BPE optimizations.
I haven't seen the papers, seems I've reinvented the linearithmic byte pair merge which I've contributed back to tiktoken and jtokkit, see:
- https://github.com/openai/tiktoken/pulls?q=is%3Apr+author%3Apaplorinc
- https://github.com/knuddelsgmbh/jtokkit/pulls?q=+is%3Apr+author%3Apaplorinc
I think i found a small bug in your algorithm. For example, if the input sequence is 1,2,1,2,1,2 then first the pair (1,2) is merged to lets say token 3. When merging into token 3 we actually call AddPosition(3,3) twice at two different positions, but it is only possible to substitute the pair (3,3) once. In a later iteration this will throw an error because we try to access a deleted node. (Depending on the implementation of AddPosition).
I used your algorithm and implemented it in c++/ctypes to speed up the Regex tokenizer by a lot. See https://github.com/NiceGuySaysHi/QuickBPE Right now it takes ~60 seconds on my 12th Gen Intel(R) Core(TM) i5-12400F 2.50 GHz to train on100mb of text. To tokenize 100mb it takes ~20 seconds, however most time is spend on the regex splitting, the c++ call itself only takes ~3 seconds
You can avoid the regex completely, see my parser for JTokkit: https://github.com/knuddelsgmbh/jtokkit/pull/77/files#diff-4836953d74c2e15ea9fb4c739ec54dfa0709682d04c237e46f7ca9af6630cdaa
does this give better compression rates? @paplorinc
No, it's just magnitudes faster than the implementation here, but produces the exact same output as tiktoken
Ah, i see. At some point i had a similar implementation as well. I think i dropped it because it was too memory intensive. I will take a look at it again, maybe i can improve it.