tiktoken icon indicating copy to clipboard operation
tiktoken copied to clipboard

Cache for Encoding - Runtime Boosted by 12%

Open Majdoddin opened this issue 1 year ago • 0 comments

This PR introduces a caching mechanism in _encode_ordinary_native(), which stores the tokens for each "piece" of text. When a piece of text is repeated, its tokens are retrieved from the cache instead of being tokenized again.

This results in a runtime improvement of over 12% (from 20.21s to 17.96s on a single CPU core) when encoding 100MB of Linux source code as a single text.

The cache hit ratio is very high, approximately 95%. The final cache size is only 0.5% of the total number of pieces (218,450 vs. 39,769,721).

TODO:

  • Despite the 95% cache hit ratio, the expected runtime boost was not fully realized. This is because 80% of the loop runtime in the current code is spent splitting the text using regex. While this PR makes the tokenization logic 65% faster, the BIG gain can be achieved by optimizing the text splitting, possibly through multithreading.
  • Investigate declaring the cache in the struct CoreBPE so that it can be utilized across subsequent calls.

Majdoddin avatar Jul 10 '24 10:07 Majdoddin