tiktoken
tiktoken copied to clipboard
Extremely Long Text results in PanicException, which is hard to catch in python code
For some extremely long sequences, the tokenizer can result in a PanicException. Example
import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")
text = "^" * 1000000
tokenizer.encode(text) # this throws a PanicException
The issue is that PanicException is not caught even while catching Exception, and can only be caught by catching a BaseException, which is too broad.
Would it be possible to raise a better exception for such a scenario (maybe something similar to what was done here ?)
The workaround that I currently have is catching the BaseException, and checking for "PanicException" in the exception message. Not sure if it is the best way to do this. Would be grateful for any guidance :)
(Realised I hadn't replied)
Thanks for opening this! Yeah, it definitely makes sense to expose the exception; thanks for linking the polars change. (And preventing an exception here would be hard, don't want to get into the game of predicting regex stack overflows)
I've got a related example, for what it's worth, and it really does occur in the RedPajama v1 dataset.
In 0.5.1 with enc = tiktoken.encoding_for_model("gpt-4"), calling enc.encode_ordinary results in something like:
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)', src/lib.rs:213:29
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
PanicException: called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
File <command-3972960916005538>, line 1
----> 1 enc.encode_ordinary(foo[0].text)
File /databricks/python/lib/python3.10/site-packages/tiktoken/core.py:69, in Encoding.encode_ordinary(self, text)
60 """Encodes a string into tokens, ignoring special tokens.
61
62 This is equivalent to `encode(text, disallowed_special=())` (but slightly faster).
(...)
66 [31373, 995]
67 """
68 try:
---> 69 return self._core_bpe.encode_ordinary(text)
70 except UnicodeEncodeError:
71 # See comment in encode
72 text = text.encode("utf-16", "surrogatepass").decode("utf-16", "replace")
PanicException: called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)
The input looks like this - I've truncated for brevity:
Les Mots d’Hiver 2023 : le programme complet !
Notre rendez-vous du mois de février se déroulera du 30 janvier au 18 février 2023. Comme chaque année, vous
...
esopace culturel de l\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\...
And it just ends in like a million backslashes. Who knows. I don't think it's the length per se as longer inputs work fine. I suspect it's the repeated characters, possibly treated as special characters by a regex (?) just overwhelms the regex.
I'm not sure how you want to deal with bad input except to indeed expose perhaps a better exception. I can work around this in my parsing manually.