openai-python Tiktoken says ChatGPT's API, `gpt-3.5-turbo`, uses the cl100k_base encoder, but it appears to use p50k

Describe the bug

Tiktoken (https://github.com/openai/tiktoken/blob/3e8620030c68d2fd6d4ec6d38426e7a1983661f5/tiktoken/model.py#L14) shows ChatGPT's API, gpt-3.5-turbo, tiktoken encoder to be cl100k_base; however, when using the openai package if I use the cl100k_base encoder to truncate my prompt, I get the following error, but if I use p50k_base, I don't get the error. So, it appears that either the correct tokenizer is p50k_base or the wrong tokenizer may be set in openai.

openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 4104 tokens. Please reduce the length of the messages.

To Reproduce

It will fail if you run the code as is with full_text longer than 4061 tokens. However, if you change

tokenizer = tiktoken.get_encoding("cl100k_base" if model_name == "gpt-3.5-turbo" else "p50k_base")

to

tokenizer = tiktoken.get_encoding("p50k_base")

everything works as expected.

Code snippets

import tiktoken
from langchain import OpenAI, PromptTemplate

full_text = "The content of this article, https://nymag.com/news/features/mark-zuckerberg-2012-5/?mid=nymag_press"
model_name = "gpt-3.5-turbo"
num_keyphrases = 5

# Define the prompt template
template = """Suggest the top {num_keyphrases} keywords that best describe the most important topics or themes in following text:

###

TEXT: {full_text}

###

Top {num_keyphrases} Keywords:"""

prompt_template = PromptTemplate(
    input_variables=["num_keyphrases", "full_text"], template=template
)

# Get the top keyphrases from the article

# Load the model
llm = OpenAI(model_name=model_name, temperature=0)
# Get the maximum length of the text
tokenizer = tiktoken.get_encoding("cl100k_base" if model_name == "gpt-3.5-turbo" else "p50k_base")
model_context_size = (
    4097 if model_name == "gpt-3.5-turbo" else llm.modelname_to_contextsize(model_name)
)
text_max_length = model_context_size - len(
    tokenizer.encode(
        prompt_template.format(num_keyphrases=num_keyphrases, full_text="")
    )
)

# Truncate the text if it is too long
full_text = tokenizer.decode(tokenizer.encode(full_text)[:text_max_length])

# Get the top keyphrases from the article
response = llm(
    prompt_template.format(num_keyphrases=num_keyphrases, full_text=full_text)
)

print(response)

OS

macOS

Python version

Python v3.10.9

Library version

openai v0.27.2

Mar 13 '23 20:03 CMobley7

@CMobley7 that's not a bug, it's just specific to how tokens are actually counted for ChatML models, see https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb (section 6) to see how to count tokens properly for models with messages instead of the traditional prompt.

Pasted from there for quick reference:

def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):
    """Returns the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    if model == "gpt-3.5-turbo-0301":  # note: future models may deviate from this
        num_tokens = 0
        for message in messages:
            num_tokens += 4  # every message follows <im_start>{role/name}\n{content}<im_end>\n
            for key, value in message.items():
                num_tokens += len(encoding.encode(value))
                if key == "name":  # if there's a name, the role is omitted
                    num_tokens += -1  # role is always required and always 1 token
        num_tokens += 2  # every reply is primed with <im_start>assistant
        return num_tokens
    else:
        raise NotImplementedError(f"""num_tokens_from_messages() is not presently implemented for model {model}.
See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tok

Mar 18 '23 04:03 ghost

CMobley7 is right tho, this is still a problem. It has nothing to do with the token count and everything to do with the wrong encoding.

    tiktoken.get_encoding("r50k_base").encode("testing123123")

33407, 10163, 10163

    tiktoken.get_encoding("cl100k_base").encode("testing123123")

9016, 4513, 4513

gpt tokenizer : (https://platform.openai.com/tokenizer) testing123123 Tokens 3 Characters 13 [33407, 10163, 10163]

Jun 14 '23 10:06 Ziggyware

@Ziggyware https://platform.openai.com/tokenizer only has the tokenizers for the Codex model and GPT-3, and gpt-3.5-turbo with gpt-4 have a different tokenizer from base GPT-3 models like davinci or text-davinci-003, so there's nothing wrong with the encoding.

Again:

davinci, text-davinci-003 and other models with the similar architecture - p50k_base (or r50k_base, I'm not sure exactly, but they're really close anyway)
gpt-3.5-turbo, gpt-4 - cl100k_base (very different from other tokenizers)

The OpenAI's tokenizer website does not have cl100k_base on it yet, you can use alternative websites or write your own similar script with the use of tiktoken.

Random one I found: https://www.typeblock.co/resources/tokenizer The result here corresponds to your result with cl100k_base.

Jun 14 '23 10:06 ghost

Thanks, I guess it's just really confusing, I'm sure I'm not the only one scratching my head :)

Jun 14 '23 10:06 Ziggyware

@Ziggyware I think it's mainly confusing because OpenAI still hasn't updated their website to include the GPT 3.5/4 tokenizer, which is quite baffling, I agree.

Jun 14 '23 10:06 ghost

Still trying to figure out this part. My code working well for gpt 3.5 to 4.0 turbo. Started getting the "error message: get_num_tokens_from_messages() is not presently implemented for model cl100k_base" error with gpt 4o.

May 23 '24 09:05 KPCHNG

Tiktoken says ChatGPT's API, `gpt-3.5-turbo`, uses the cl100k_base encoder, but it appears to use p50k_base in openai

Describe the bug

To Reproduce

Code snippets

OS

Python version

Library version