minbpe icon indicating copy to clipboard operation
minbpe copied to clipboard

Faster BPE

Open zouharvi opened this issue 4 months ago • 11 comments

Thanks for the nice repo! 🙂

Not sure if it's welcome here given that the goal of this repo is to be for educational purposes but you call the get_stats function, which is computationally heavy, V times, yielding O(|S| V) where V is the vocab size and |S| is the text size.

Huggingface tokenizers (tokenizers-rs) and other libraries use a live data structure (max heap with pointers) that makes it possible to not recompute the stats after each merge. The improved runtime is O(|S| log V).

It is described in Algorithm 2 here: https://aclanthology.org/2023.findings-acl.38.pdf

Here are the two algorithms: Screenshot_20240217-105550~2 Screenshot_20240217-105614~2

Btw it also has a mini version of BPE if anyone is interested in code-golfing BPE (would be cool to see the limits and hacks, I'm quite bad at it): Screenshot_20240217-105759~2

zouharvi avatar Feb 17 '24 09:02 zouharvi