tiktoken Uses Regex instead of fancy-regex

Uses Regex instead of fancy-regex - 6x speedup

Open Majdoddin opened this issue 1 year ago • 3 comments

This PR realizes the wish expressed in current code to use the faster Regex.

The text is splitted to pieces, before tokenization, according to regular expression patterns. This PR drops a lookahead part of the pattern, the part for catching the whitespaces, and handles the whitespaces with scripting instead, with mathematically provable exactly same output. This makes it possible to use linear-time Regex instead of fancy-regex, as Regex does not support lookahead, resulting in a 14x speedup of pattern matching. As pattern matching currently comprises 90% of the encoding runtime, the total runtime is boosted 6x.

Although fancy_regex delegates to Regex, when the pattern has no special features, it is still some 10% slower in test, thus we directly use Regex. This improvement is for pattern matching of the parts with ordinary text. Catching the special tokens is still done with fancy_regex.

Tests For encoding o200k_base (used by model GPT-4o)

Text	Number of tokens	Current Runtime	PR Runtime
wikitext-103 (100 MB)	22138325	18.94s	4.94s
Linux code (100 MB)	36119543	30.28s	4.59s

Aug 05 '24 09:08 Majdoddin

tiktoken tiktoken copied to clipboard

Uses Regex instead of fancy-regex - 6x speedup

tiktoken
tiktoken copied to clipboard