tiktoken-php
tiktoken-php copied to clipboard
Performance degrades to quadratic for some input strings
For example, a long input string that is all letters takes a very long time to run because it ends up as one giant mergeBytePairs call:
$input = implode('', array_map(fn() => 'abcdefghijklmnopqrstuvwxyz'[mt_rand(0, 25)], range(1, 100_000)));
(new \Yethee\Tiktoken\EncoderProvider)->getForModel('gpt-4o-mini')->encode($input) // takes > 5 minutes
I'm curious what other BPE implementations do here. Perhaps there is a way to handle long inputs to mergeBytePairs with a different algorithm that scales better?
One easy (but perhaps undesirable) solution is to limit the length of inputs to mergeBytePairs. For example if $piece is > n chars we break it into smaller chunks and merge each one. This will be fast, but might not find the optimal encoding (it might report more tokens than strictly required). However, for large enough chunk sizes it should be close to optimal. Maybe this could be an optional arg to the encode method ->encode($text, exact: false)?
First, we split a input string into pieces using a regular expression (you can see patterns here). Then mergeBytePairs() method is called per a piece.
As I understand it, the performance issue occurs when a single piece is very long.
In what cases does it make sense to tokenize a long string that cannot be split into pieces?
As I understand it, the performance issue occurs when a single piece is very long.
Exactly
In what cases does it make sense to tokenize a long string that cannot be split into pieces?
A couple examples:
- Someone could easily craft malicious input to attack a site that was using tiktoken to manage context length (e.g. imagine a file summarizer that wanted to check whether the contents exceeded the token limit before submitting to an LLM). You can craft a file of very reasonable size in terms of bytes that can't be handled efficiently.
- When extracting text from a PDF, whitespace is sometimes lost which could lead to tons of letters clumped together. LLMs can read such content without difficulty, but tiktoken would choke on it.
The 0.9.0 release adds support for the Rust library via FFI bindings. Can you try to use the lib encoder (see #27 for details), will it solve the performance issue in your case?