tiktoken icon indicating copy to clipboard operation
tiktoken copied to clipboard

Add im_start / im_end in cl100k_base

Open spolu opened this issue 1 year ago • 4 comments

Not really directly useful given the Chat API...

But triangulating:

  • https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb
  • https://github.com/openai/tiktoken/commit/ec7c121e385bf1675312c6c33734de6b392890c4#diff-0d973848bd229418209db2c46c86167000845592ca6b98fad215c21c317bc494R9

We know they exist.

spolu avatar Mar 05 '23 17:03 spolu

what does im mean in im_start/im_end?

zyxue avatar Jun 01 '23 22:06 zyxue

They are the special tokens used in the OpenAI Chat format as it gets translated and presented to the model.

spolu avatar Jun 02 '23 07:06 spolu

@zyxue It seems to be "input message". Check here.

youkaichao avatar Jun 02 '23 13:06 youkaichao

@spolu

It's possible that the APi side uses and extended tokenizer like: https://github.com/openai/tiktoken/tree/main#extending-tiktoken

microsoftbuild avatar Aug 12 '23 12:08 microsoftbuild