tiktoken icon indicating copy to clipboard operation
tiktoken copied to clipboard

Replace js-tiktoken BPE merge algorithm with faster heap based algorithm

Open mikolalysenko opened this issue 10 months ago • 3 comments

There are several open issues noting that in the worst case BPE merge algorithm in js-tiktoken takes quadratic time in the number of input characters for certain pathological inputs.

This PR fixes this problem using a heap to avoid recalculating the ranks of all tokens at each character. This technique should also work for the rust/wasm tokenizer but it seems less important in those cases since the native parsers are already pretty fast.

I also added a new test fixture and an example string which causes pathological behavior.

Related issues:

  • https://github.com/dqbd/tiktoken/issues/99
  • https://github.com/openai/tiktoken/issues/195

Note: This should be a mild CVE since an attacker may use this behavior to cause a denial of service against services that check user input with js-tiktoken.

mikolalysenko avatar Apr 19 '24 21:04 mikolalysenko