Inconsistent `prompt_tokens` definition between `text-davinci-002` and `code-davinci-002`
Ubuntu 18.04.6 openai-python 0.23.1 python 3.8.13
I'm facing unexpected results when using prompt_tokens returned from the completion API in the davinci codex model.
Consider the following function to reconstruct the prompt using the API's response:
def test_prompt_idx(prompt: str, engine=str):
response: OpenAIObject = openai.Completion.create(
prompt=prompt,
stop=["\n"],
temperature=0.0,
engine=engine,
max_tokens=32,
logprobs=5,
echo=True,
)
n_prompt_tokens: int = response["usage"]["prompt_tokens"]
prompt_tokens = [
{
"val": response["choices"][0]["logprobs"]["tokens"][i],
"options": response["choices"][0]["logprobs"]["top_logprobs"][i],
"logprob": response["choices"][0]["logprobs"]["token_logprobs"][i]
if i != 0
else 0.0, # first token has logprob None
}
for i in range(n_prompt_tokens)
]
reconstructed_prompt = "".join(token["val"] for token in prompt_tokens)
assert reconstructed_prompt == prompt
When I use text-danvinci-002, the snippet runs fine
test_prompt_idx(
prompt="""import numpy as np
a = np.array(object=[0,1,2])
b = np.array(""",
engine="text-davinci-002"
)
test_prompt_idx(
prompt="""import numpy as np
a = np.array(object=[0,1,2])
b = np.array(object=""",
engine="text-davinci-002"
)
However, when I use code-davinci-002, this snippet fails
test_prompt_idx(
prompt="""import numpy as np
a = np.array(object=[0,1,2])
b = np.array(""",
engine="code-davinci-002"
)
test_prompt_idx(
prompt="""import numpy as np
a = np.array(object=[0,1,2])
b = np.array(object=""",
engine="code-davinci-002"
)
Comparing reconstructed_prompt and prompt shows that for code-davinci-002, it seems to be that n_prompt_tokens's value is one less than expected.
I've observed that this could lead to the codex model changing the last token of the prompt.
For instance, if I update my test_prompt_idx() function to use n_prompt_tokens+1 instead, then the following snippet passes
test_prompt_idx(
prompt="""import numpy as np
a = np.array(object=[0,1,2])
b = np.array(""",
engine="code-davinci-002"
)
because the first completion token returned by codex (i.e.: object, with tokenid of 15252 according to the tokenizer).
However, the following fails
test_prompt_idx(
prompt="""import numpy as np
a = np.array(object=[0,1,2])
b = np.array(object=""",
engine="code-davinci-002"
)
because the first completion token returned is =[ with token id 41888, instead of the last token of the given prompt, which is = with token id 28.