tiktoken
tiktoken copied to clipboard
Uses Regex instead of fancy-regex - 6x speedup
This PR realizes the wish expressed in current code to use the faster Regex.
The text is splitted to pieces, before tokenization, according to regular expression patterns. This PR drops a lookahead part of the pattern, the part for catching the whitespaces, and handles the whitespaces with scripting instead, with mathematically provable exactly same output.
This makes it possible to use linear-time Regex instead of fancy-regex, as Regex does not support lookahead, resulting in a 14x speedup of pattern matching. As pattern matching currently comprises 90% of the encoding runtime, the total runtime is boosted 6x.
Although fancy_regex delegates to Regex, when the pattern has no special features, it is still some 10% slower in test, thus we directly use Regex.
This improvement is for pattern matching of the parts with ordinary text. Catching the special tokens is still done with fancy_regex.
Tests For encoding o200k_base (used by model GPT-4o)
| Text | Number of tokens | Current Runtime | PR Runtime |
|---|---|---|---|
| wikitext-103 (100 MB) | 22138325 | 18.94s | 4.94s |
| Linux code (100 MB) | 36119543 | 30.28s | 4.59s |