tiktoken icon indicating copy to clipboard operation
tiktoken copied to clipboard

Extremely Long Text results in PanicException, which is hard to catch in python code

Open codedecde opened this issue 2 years ago • 3 comments
trafficstars

For some extremely long sequences, the tokenizer can result in a PanicException. Example

import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")
text = "^" * 1000000

tokenizer.encode(text)  # this throws a PanicException

The issue is that PanicException is not caught even while catching Exception, and can only be caught by catching a BaseException, which is too broad.

Would it be possible to raise a better exception for such a scenario (maybe something similar to what was done here ?)

The workaround that I currently have is catching the BaseException, and checking for "PanicException" in the exception message. Not sure if it is the best way to do this. Would be grateful for any guidance :)

codedecde avatar Dec 21 '22 22:12 codedecde

(Realised I hadn't replied)

Thanks for opening this! Yeah, it definitely makes sense to expose the exception; thanks for linking the polars change. (And preventing an exception here would be hard, don't want to get into the game of predicting regex stack overflows)

hauntsaninja avatar Feb 13 '23 20:02 hauntsaninja

I've got a related example, for what it's worth, and it really does occur in the RedPajama v1 dataset.

In 0.5.1 with enc = tiktoken.encoding_for_model("gpt-4"), calling enc.encode_ordinary results in something like:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)', src/lib.rs:213:29
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
PanicException: called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
File <command-3972960916005538>, line 1
----> 1 enc.encode_ordinary(foo[0].text)

File /databricks/python/lib/python3.10/site-packages/tiktoken/core.py:69, in Encoding.encode_ordinary(self, text)
     60 """Encodes a string into tokens, ignoring special tokens.
     61 
     62 This is equivalent to `encode(text, disallowed_special=())` (but slightly faster).
   (...)
     66 [31373, 995]
     67 """
     68 try:
---> 69     return self._core_bpe.encode_ordinary(text)
     70 except UnicodeEncodeError:
     71     # See comment in encode
     72     text = text.encode("utf-16", "surrogatepass").decode("utf-16", "replace")

PanicException: called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)

The input looks like this - I've truncated for brevity:

Les Mots d’Hiver 2023 : le programme complet !
Notre rendez-vous du mois de février se déroulera du 30 janvier au 18 février 2023. Comme chaque année, vous 
...
esopace culturel de l\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\...

And it just ends in like a million backslashes. Who knows. I don't think it's the length per se as longer inputs work fine. I suspect it's the repeated characters, possibly treated as special characters by a regex (?) just overwhelms the regex.

I'm not sure how you want to deal with bad input except to indeed expose perhaps a better exception. I can work around this in my parsing manually.

srowen avatar Dec 16 '23 21:12 srowen