Tiktoken says ChatGPT's API, `gpt-3.5-turbo`, uses the cl100k_base encoder, but it appears to use p50k_base in openai
Describe the bug
Tiktoken (https://github.com/openai/tiktoken/blob/3e8620030c68d2fd6d4ec6d38426e7a1983661f5/tiktoken/model.py#L14) shows ChatGPT's API, gpt-3.5-turbo, tiktoken encoder to be cl100k_base; however, when using the openai package if I use the cl100k_base encoder to truncate my prompt, I get the following error, but if I use p50k_base, I don't get the error. So, it appears that either the correct tokenizer is p50k_base or the wrong tokenizer may be set in openai.
openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 4104 tokens. Please reduce the length of the messages.
To Reproduce
It will fail if you run the code as is with full_text longer than 4061 tokens. However, if you change
tokenizer = tiktoken.get_encoding("cl100k_base" if model_name == "gpt-3.5-turbo" else "p50k_base")
to
tokenizer = tiktoken.get_encoding("p50k_base")
everything works as expected.
Code snippets
import tiktoken
from langchain import OpenAI, PromptTemplate
full_text = "The content of this article, https://nymag.com/news/features/mark-zuckerberg-2012-5/?mid=nymag_press"
model_name = "gpt-3.5-turbo"
num_keyphrases = 5
# Define the prompt template
template = """Suggest the top {num_keyphrases} keywords that best describe the most important topics or themes in following text:
###
TEXT: {full_text}
###
Top {num_keyphrases} Keywords:"""
prompt_template = PromptTemplate(
input_variables=["num_keyphrases", "full_text"], template=template
)
# Get the top keyphrases from the article
# Load the model
llm = OpenAI(model_name=model_name, temperature=0)
# Get the maximum length of the text
tokenizer = tiktoken.get_encoding("cl100k_base" if model_name == "gpt-3.5-turbo" else "p50k_base")
model_context_size = (
4097 if model_name == "gpt-3.5-turbo" else llm.modelname_to_contextsize(model_name)
)
text_max_length = model_context_size - len(
tokenizer.encode(
prompt_template.format(num_keyphrases=num_keyphrases, full_text="")
)
)
# Truncate the text if it is too long
full_text = tokenizer.decode(tokenizer.encode(full_text)[:text_max_length])
# Get the top keyphrases from the article
response = llm(
prompt_template.format(num_keyphrases=num_keyphrases, full_text=full_text)
)
print(response)
OS
macOS
Python version
Python v3.10.9
Library version
openai v0.27.2
@CMobley7 that's not a bug, it's just specific to how tokens are actually counted for ChatML models, see https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb (section 6) to see how to count tokens properly for models with messages instead of the traditional prompt.
Pasted from there for quick reference:
def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):
"""Returns the number of tokens used by a list of messages."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")
if model == "gpt-3.5-turbo-0301": # note: future models may deviate from this
num_tokens = 0
for message in messages:
num_tokens += 4 # every message follows <im_start>{role/name}\n{content}<im_end>\n
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name": # if there's a name, the role is omitted
num_tokens += -1 # role is always required and always 1 token
num_tokens += 2 # every reply is primed with <im_start>assistant
return num_tokens
else:
raise NotImplementedError(f"""num_tokens_from_messages() is not presently implemented for model {model}.
See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tok
CMobley7 is right tho, this is still a problem. It has nothing to do with the token count and everything to do with the wrong encoding.
tiktoken.get_encoding("r50k_base").encode("testing123123")
33407, 10163, 10163
tiktoken.get_encoding("cl100k_base").encode("testing123123")
9016, 4513, 4513
gpt tokenizer : (https://platform.openai.com/tokenizer) testing123123 Tokens 3 Characters 13 [33407, 10163, 10163]
@Ziggyware https://platform.openai.com/tokenizer only has the tokenizers for the Codex model and GPT-3, and gpt-3.5-turbo with gpt-4 have a different tokenizer from base GPT-3 models like davinci or text-davinci-003, so there's nothing wrong with the encoding.
Again:
davinci,text-davinci-003and other models with the similar architecture -p50k_base(orr50k_base, I'm not sure exactly, but they're really close anyway)gpt-3.5-turbo,gpt-4-cl100k_base(very different from other tokenizers)
The OpenAI's tokenizer website does not have cl100k_base on it yet, you can use alternative websites or write your own similar script with the use of tiktoken.
Random one I found:
https://www.typeblock.co/resources/tokenizer
The result here corresponds to your result with
cl100k_base.
Thanks, I guess it's just really confusing, I'm sure I'm not the only one scratching my head :)
@Ziggyware I think it's mainly confusing because OpenAI still hasn't updated their website to include the GPT 3.5/4 tokenizer, which is quite baffling, I agree.
Still trying to figure out this part. My code working well for gpt 3.5 to 4.0 turbo. Started getting the "error message: get_num_tokens_from_messages() is not presently implemented for model cl100k_base" error with gpt 4o.