tiktoken icon indicating copy to clipboard operation
tiktoken copied to clipboard

Add possessive quantifiers to avoid catastrophic backtracking

Open l0rinc opened this issue 1 year ago • 0 comments
trafficstars

Fixes the crash in https://github.com/openai/tiktoken/issues/245 by prohibiting the regex engine from backtracking catastrophically via possessive quantifiers.

image

Interestingly these possesives make the encoding a lot faster again in fancy-regex.

Before this change (but with large byte pair merge PR cherry-picked):

num_threads: 1, num_bytes: 98379553
tiktoken 	11,946,036 bytes / s
tiktoken 	11,961,343 bytes / s
tiktoken 	11,995,846 bytes / s
tiktoken 	11,951,263 bytes / s
tiktoken 	11,983,405 bytes / s

Same, with these changes applied:

num_threads: 1, num_bytes: 98379553
tiktoken 	14,511,827 bytes / s
tiktoken 	14,638,134 bytes / s
tiktoken 	14,644,029 bytes / s
tiktoken 	14,729,030 bytes / s
tiktoken 	14,666,903 bytes / s

Updating the regex libs makes it a tiny bit faster still:

num_threads: 1, num_bytes: 98379553
tiktoken 	14,485,590 bytes / s
tiktoken 	14,854,049 bytes / s
tiktoken 	14,891,086 bytes / s
tiktoken 	14,843,007 bytes / s
tiktoken 	14,874,520 bytes / s

This is almost 2x faster than before any of the optimizations.


Opened an issue for increasing the default backtrack limit, see: https://github.com/fancy-regex/fancy-regex/issues/134, but it shouldn't be necessary here anymore.

l0rinc avatar Feb 11 '24 22:02 l0rinc