tiktoken
tiktoken copied to clipboard
Add im_start / im_end in cl100k_base
Not really directly useful given the Chat API...
But triangulating:
- https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb
- https://github.com/openai/tiktoken/commit/ec7c121e385bf1675312c6c33734de6b392890c4#diff-0d973848bd229418209db2c46c86167000845592ca6b98fad215c21c317bc494R9
We know they exist.
what does im
mean in im_start
/im_end
?
They are the special tokens used in the OpenAI Chat format as it gets translated and presented to the model.
@zyxue It seems to be "input message". Check here.
@spolu
It's possible that the APi side uses and extended tokenizer like: https://github.com/openai/tiktoken/tree/main#extending-tiktoken