openai-cookbook not clear which encoding to use with gpt-3.5-turbo

trafficstars

I don't see where it says which encoding to use with gpt-3.5-turbo, can you add that explicitly both on the tiktoken and the turbo pages?

Mar 01 '23 23:03 fredzannarbor

Yes, will do. Use cl100k_base as the encoding.

And if you use tiktoken to count tokens for ChatGPT API calls, for now you can add 4 to the lengths of the content and name fields, per message.

Mar 02 '23 00:03 ted-at-openai

Yes, will do. Use cl100k_base as the encoding.

And if you use tiktoken to count tokens for ChatGPT API calls, for now you can add 4 to the lengths of the content and name fields, per message.

Is there an example? What I am using here is wrong

import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

print(num_tokens_from_string("""{"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}""", "cl100k_base"))

Mar 02 '23 05:03 xujimu

We will have an update in the docs soon to make the counting more accurate.

Mar 02 '23 14:03 logankilpatrick

In the meantime, you can use:

import tiktoken


def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):
    """Returns the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    if model == "gpt-3.5-turbo-0301":
        num_tokens = 0
        for message in messages:
            num_tokens += 4  # every message follows <im_start>{role/name}\n{content}<im_end>\n
            for key, value in message.items():
                num_tokens += len(encoding.encode(value))
                if key == "name":  # if there's a name, the role is omitted
                    num_tokens += -1  # role is always required and always 1 token
        num_tokens += 2  # every reply is primed with <im_start>assistant
        return num_tokens
    else:
        raise NotImplementedError("""num_tokens_from_messages() is not implemented for this model.
See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")

Mar 02 '23 18:03 ted-at-openai

Updated here: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

Mar 03 '23 06:03 ted-at-openai

openai-cookbook openai-cookbook copied to clipboard

not clear which encoding to use with gpt-3.5-turbo

openai-cookbook
openai-cookbook copied to clipboard