minbpe icon indicating copy to clipboard operation
minbpe copied to clipboard

Faster BPE

Open zouharvi opened this issue 1 year ago • 11 comments

Thanks for the nice repo! 🙂

Not sure if it's welcome here given that the goal of this repo is to be for educational purposes but you call the get_stats function, which is computationally heavy, V times, yielding O(|S| V) where V is the vocab size and |S| is the text size.

Huggingface tokenizers (tokenizers-rs) and other libraries use a live data structure (max heap with pointers) that makes it possible to not recompute the stats after each merge. The improved runtime is O(|S| log V).

It is described in Algorithm 2 here: https://aclanthology.org/2023.findings-acl.38.pdf

Here are the two algorithms: Screenshot_20240217-105550~2 Screenshot_20240217-105614~2

Btw it also has a mini version of BPE if anyone is interested in code-golfing BPE (would be cool to see the limits and hacks, I'm quite bad at it): Screenshot_20240217-105759~2

zouharvi avatar Feb 17 '24 09:02 zouharvi

Thank you for the pointers!

So I think we should retain the nice and clean and minimal and inefficient version as an algorithmic (and unit test) reference, but in addition to that we could have a version that is optimized for speed (e.g. bpe_regex_fast.py), and then probably even further lower that, e.g. to C/Rust/other.

So I'm not opposed to optimizations, we'd just keep that functionality separate. I'll either get around to this at some point, or I'd welcome PRs.

karpathy avatar Feb 17 '24 15:02 karpathy

One additional model I'm thinking about is what I did with llama2.c, where instead of reviewing tons of PRs for the root repo, it is treated more as a reference and people fork it to their own languages/repos optimized for different purposes, and then I'm happy to link to it from the Readme here. This path is advisable for 1) larger re-writes, 2) where people want to iterate fast themselves and 3) don't want me personally to become a PR bottleneck.

karpathy avatar Feb 17 '24 15:02 karpathy

I implemented heap-based token merging for my simplified TinyLlama implementation. Feel free to copy what you need. Here is the function:

https://github.com/99991/SimpleTinyLlama/blob/main/tokenizer.py#L70

This function might be easier to understand after reading the code of the slow implementation first. It can be found below this function in a commend:

https://github.com/99991/SimpleTinyLlama/blob/9af6f7df6e12d8478a90d3cd5c8e8c1a95fce0fe/tokenizer.py#L136

99991 avatar Feb 18 '24 13:02 99991

I've actually implemented the above optimizations in Rust: https://github.com/gautierdag/bpeasy/tree/main

It mostly follows what's done in the Huggingface's tokenizers library but in 400 lines, so might be an easier guide for anyone trying to port BPE optimizations.

gautierdag avatar Feb 19 '24 10:02 gautierdag

I haven't seen the papers, seems I've reinvented the linearithmic byte pair merge which I've contributed back to tiktoken and jtokkit, see:

  • https://github.com/openai/tiktoken/pulls?q=is%3Apr+author%3Apaplorinc
  • https://github.com/knuddelsgmbh/jtokkit/pulls?q=+is%3Apr+author%3Apaplorinc

l0rinc avatar Feb 21 '24 13:02 l0rinc

I think i found a small bug in your algorithm. For example, if the input sequence is 1,2,1,2,1,2 then first the pair (1,2) is merged to lets say token 3. When merging into token 3 we actually call AddPosition(3,3) twice at two different positions, but it is only possible to substitute the pair (3,3) once. In a later iteration this will throw an error because we try to access a deleted node. (Depending on the implementation of AddPosition). Screenshot 2024-03-05 163536

JohannesVod avatar Mar 05 '24 15:03 JohannesVod

I used your algorithm and implemented it in c++/ctypes to speed up the Regex tokenizer by a lot. See https://github.com/NiceGuySaysHi/QuickBPE Right now it takes ~60 seconds on my 12th Gen Intel(R) Core(TM) i5-12400F 2.50 GHz to train on100mb of text. To tokenize 100mb it takes ~20 seconds, however most time is spend on the regex splitting, the c++ call itself only takes ~3 seconds

JohannesVod avatar Mar 26 '24 13:03 JohannesVod

You can avoid the regex completely, see my parser for JTokkit: https://github.com/knuddelsgmbh/jtokkit/pull/77/files#diff-4836953d74c2e15ea9fb4c739ec54dfa0709682d04c237e46f7ca9af6630cdaa

l0rinc avatar Mar 26 '24 13:03 l0rinc

does this give better compression rates? @paplorinc

JohannesVod avatar Mar 26 '24 13:03 JohannesVod

No, it's just magnitudes faster than the implementation here, but produces the exact same output as tiktoken

l0rinc avatar Mar 26 '24 13:03 l0rinc

Ah, i see. At some point i had a similar implementation as well. I think i dropped it because it was too memory intensive. I will take a look at it again, maybe i can improve it.

JohannesVod avatar Mar 26 '24 15:03 JohannesVod