tiktoken icon indicating copy to clipboard operation
tiktoken copied to clipboard

Add support for fine-tuned models in encoding_for_model

Open thespino opened this issue 1 year ago • 2 comments

Issue

When trying to call encoding_for_model providing a fine-tuned model as input, the following error occurs:

KeyError: 'Could not automatically map davinci:ft-personal:finetunedmodel-2023-05-23-20-00-00 to a tokeniser. Please use `tiktok.get_encoding` to explicitly get the tokeniser you expect.'

Analysis

See https://platform.openai.com/docs/models/model-endpoint-compatibility See https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

The following models are allowed for fine-tuning:

  • davinci
  • curie
  • babbage
  • ada

All of them use the encoding r50k_base.

Fine-tuned models names always follow this format: model:ft-personal:name:date where

  • model is the base model from which the fine-tuned one has been created
  • ft-personal is a fixed string that tells that the model is fine-tuned
  • name is a custom name that the user can give to the new model
  • date is the date of fine-tuning in the format yyyy-MM-dd-hh-mm-ss

Solutions

Map the models prefixes in MODEL_PREFIX_TO_ENCODING, so that when encoding_for_model calls model_name.startswith, it can also identify all models starting with "davinci", "ada", etc... and, so, identify fine-tuned models.

thespino avatar May 23 '23 18:05 thespino

Thanks for opening this PR @thespino – I've also been running into this issue and am eager to have this released

cc @hauntsaninja

byrnehollander avatar May 25 '23 17:05 byrnehollander

Rebased & synced with main branch

thespino avatar Jun 01 '23 21:06 thespino