minbpe
minbpe copied to clipboard
Faster BPE
Thanks for the nice repo! 🙂
Not sure if it's welcome here given that the goal of this repo is to be for educational purposes but you call the get_stats
function, which is computationally heavy, V
times, yielding O(|S| V)
where V
is the vocab size and |S|
is the text size.
Huggingface tokenizers (tokenizers-rs) and other libraries use a live data structure (max heap with pointers) that makes it possible to not recompute the stats after each merge. The improved runtime is O(|S| log V)
.
It is described in Algorithm 2 here: https://aclanthology.org/2023.findings-acl.38.pdf
Here are the two algorithms:
Btw it also has a mini version of BPE if anyone is interested in code-golfing BPE (would be cool to see the limits and hacks, I'm quite bad at it):