tiktoken-php icon indicating copy to clipboard operation
tiktoken-php copied to clipboard

Performance degrades to quadratic for some input strings

Open mikethea1 opened this issue 8 months ago • 3 comments

For example, a long input string that is all letters takes a very long time to run because it ends up as one giant mergeBytePairs call:

$input = implode('', array_map(fn() => 'abcdefghijklmnopqrstuvwxyz'[mt_rand(0, 25)], range(1, 100_000)));
(new \Yethee\Tiktoken\EncoderProvider)->getForModel('gpt-4o-mini')->encode($input) // takes > 5 minutes

I'm curious what other BPE implementations do here. Perhaps there is a way to handle long inputs to mergeBytePairs with a different algorithm that scales better?

One easy (but perhaps undesirable) solution is to limit the length of inputs to mergeBytePairs. For example if $piece is > n chars we break it into smaller chunks and merge each one. This will be fast, but might not find the optimal encoding (it might report more tokens than strictly required). However, for large enough chunk sizes it should be close to optimal. Maybe this could be an optional arg to the encode method ->encode($text, exact: false)?

mikethea1 avatar Mar 20 '25 14:03 mikethea1

First, we split a input string into pieces using a regular expression (you can see patterns here). Then mergeBytePairs() method is called per a piece. As I understand it, the performance issue occurs when a single piece is very long.

In what cases does it make sense to tokenize a long string that cannot be split into pieces?

yethee avatar Mar 20 '25 17:03 yethee

As I understand it, the performance issue occurs when a single piece is very long.

Exactly

In what cases does it make sense to tokenize a long string that cannot be split into pieces?

A couple examples:

  • Someone could easily craft malicious input to attack a site that was using tiktoken to manage context length (e.g. imagine a file summarizer that wanted to check whether the contents exceeded the token limit before submitting to an LLM). You can craft a file of very reasonable size in terms of bytes that can't be handled efficiently.
  • When extracting text from a PDF, whitespace is sometimes lost which could lead to tons of letters clumped together. LLMs can read such content without difficulty, but tiktoken would choke on it.

mikethea1 avatar Mar 20 '25 22:03 mikethea1

The 0.9.0 release adds support for the Rust library via FFI bindings. Can you try to use the lib encoder (see #27 for details), will it solve the performance issue in your case?

yethee avatar Mar 31 '25 11:03 yethee