tiktoken
tiktoken copied to clipboard
Replace js-tiktoken BPE merge algorithm with faster heap based algorithm
There are several open issues noting that in the worst case BPE merge algorithm in js-tiktoken takes quadratic time in the number of input characters for certain pathological inputs.
This PR fixes this problem using a heap to avoid recalculating the ranks of all tokens at each character. This technique should also work for the rust/wasm tokenizer but it seems less important in those cases since the native parsers are already pretty fast.
I also added a new test fixture and an example string which causes pathological behavior.
Related issues:
- https://github.com/dqbd/tiktoken/issues/99
- https://github.com/openai/tiktoken/issues/195
Note: This should be a mild CVE since an attacker may use this behavior to cause a denial of service against services that check user input with js-tiktoken.