continue Differences between GPT4 and Llama tokenizer leads to mismatch in token count during prompt pruning.

Differences between GPT4 and Llama tokenizer leads to mismatch in token count during prompt pruning.

Open rastna12 opened this issue 1 year ago • 1 comments

Before submitting your bug report

[X] I believe this is a bug. I'll try to join the Continue Discord for questions
[X] I'm not able to find an open issue that reports the same bug
[X] I've seen the troubleshooting guide on the Continue Docs

Relevant environment info

- OS: Windows 11 Pro
- Continue: v0.8.12
- IDE: VS Code
Model: Codellama 70b (Free Trial), or any Llama-base models that rely on the the Llama tokenizer.

Description

Current methods to estimate prompt token count use the GPT-4 tokenizer. However, Llama-based models use a different tokenizer. This difference leads to mismatch between estimated token count in Continue versus what is actually received by the model. The Llama tokenizer consistently produced ~30% more tokens than the GPT-4 tokenizer. Depending on the configuration of the LLM server, this can lead to inference errors from the prompt exceeding the maximum number of allowable tokens. See this Discord discussion on the topic.

Differences between the GPT-4 tokenizer and Llama tokenizer can be explored using these links: https://platform.openai.com/tokenizer https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/

Current token count estimation is done around here: https://github.com/continuedev/continue/blob/c8d793ec4599b954c4ec41fe4187d8e676e0b048/core/llm/countTokens.ts#L12

@sestinj has identified a js llama tokenizer that may be worth exploring: https://github.com/belladoreai/llama-tokenizer-js.

To reproduce

Select the Codellama (Free Trial) option from list of default models (or any Llama-based model)
Determine max prompt token length for that model
Create a prompt that that size or knowingly exceeds it. To produce this I usually pass in a whole C++ source code file as reference via the "@" prompt operator.
Submit prompt to LLM and observe error resulting in token count differences despite prompt pruning being performed by Continue

Log output

Continue error: HTTP 500 Internal Server Error from https://node-proxy-server-blue-l6vsfbzhba-uw.a.run.app/stream_complete

Error in Continue free trial server: 403 Input validation error: `inputs` tokens + `max_new_tokens` must be <= 4097. Given: 5437 `inputs` tokens and 1024 `max_new_tokens`

Feb 19 '24 16:02 rastna12

@rastna12 this is now available in pre-release. Here's the commit that did it: https://github.com/continuedev/continue/commit/e8bbdc06a192a9d6576b7019a164393c16019306

Let me know how it looks and I'll wait to close the issue until you've verified

Feb 27 '24 20:02 sestinj

I think this has been resolved given conversations in Discord. If I'm mistaken please re-open!

Mar 20 '24 23:03 sestinj

continue continue copied to clipboard

Differences between GPT4 and Llama tokenizer leads to mismatch in token count during prompt pruning.

Before submitting your bug report

Relevant environment info

Description

To reproduce

Log output

continue
continue copied to clipboard