openai-python icon indicating copy to clipboard operation
openai-python copied to clipboard

Prompt token count discrepancy

Open teoh opened this issue 2 years ago • 3 comments

Hello,

I noticed a discrepancy in prompt (not completion) token counting. Here's a minimum working example:

import os

import openai
from transformers import GPT2TokenizerFast

os.environ["OPENAI_API_KEY"] = "for you"

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

model_name = "text-davinci-002"
prompt = 'Some choices are given below. It is provided in a numbered list (1 to 1),where each item in the list corresponds to a summary.\n---------------------\n(1) A serial killer is typically a person who kills three or more people, with the murders taking place over more than a month and including a significant period of time between them. The Federal Bureau of Investigation (FBI) defines serial murder as "a series of two or more murders, committed as separate events, usually, but not always, by one offender acting alone".   == Identified serial killers ==   == Unidentified serial killers == This is a list of unidentified serial killers who committed crimes within the United States.   == See also == List of rampage killers in the United States List of mass shootings in the United StatesInternational:  List of serial killers by country List of serial killers by number of victims   == References ==   == Bibliography ==\n\n\n---------------------\nUsing only the choices above and not prior knowledge, return the choice that is most relevant to the question: \'How many serial killers in the US are there?\'\nProvide choice in the following format: \'ANSWER: <number>\' and explain why this summary was selected in relation to the question.\n'
params = {'temperature': 0.0, 'max_tokens': 256, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'best_of': 1}

completion = openai.Completion.create(model=model_name, prompt=prompt, **params)
print(completion)

prompt_token_count = len(tokenizer(prompt)["input_ids"])
print(f"prompt token count is {prompt_token_count}, which is 5 more tokens than the output above")

This will print:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": ...
    }
  ],
  ...
  "model": "text-davinci-002",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 51,
    "prompt_tokens": 252,              <---------------- what openai counts
    "total_tokens": 303
  }
}
prompt token count is 257, which is 5 more tokens than the output above

That is, openai counts 252 tokens in the prompt, but I'm counting 257. From https://beta.openai.com/tokenizer, I am tokenizing using transformers.GPT2TokenizerFast. I have also pasted prompt text above (after running it through python print()) in the url above, and I get 257 as well: image

Below is my requirements.txt:

openai==0.25.0
tokenizers==0.13.2

Is there something that I am missing here? Thanks a lot!

teoh avatar Dec 12 '22 03:12 teoh

Above, if you set

prompt = '".   == Identified serial killers ==   == Unidentified serial killers == This is a list of unidentified serial killers who committed crimes within the United States.   == See also == List of rampage killers in the United States List of mass shootings in the United StatesInternational:  List of serial killers by country List of serial killers by number of victims   == References ==   == '

The API response will give "prompt_tokens": 75, and we will print: prompt token count is 80.

image

teoh avatar Dec 12 '22 03:12 teoh

Does the discrepancy disappear if you use the "Codex" tab in the tokenizer web app? In my own tests, both text-davinci-002 and text-davinci-003 appear to be using the Codex tokenizer.

It's even easier to see the discrepancy when tokenizing code, because one of the primary differences seems to be that the Codex tokenizer collapses whitespace.

veered avatar Dec 16 '22 19:12 veered

Is this still an issue? I suspect the tokenizer mentioned above was the root issue.

hallacy avatar Mar 07 '23 03:03 hallacy