gpt3-tokenizer icon indicating copy to clipboard operation
gpt3-tokenizer copied to clipboard

[Issue] Calculate Tokens size?

Open rk-teche opened this issue 2 years ago • 6 comments

Token size is not accurate if we compare it with GPT-3 Token.

Any help would be helpful. Thanks

rk-teche avatar Feb 11 '23 19:02 rk-teche

Did you have an example that (still) does not work - the token count is identical for any text that I have checked.

evilDave avatar Mar 03 '23 07:03 evilDave

@rk-teche thank you for your feedback! There could be a discrepancy with the current OpenAI models, especially when compare with token counts from the API outputs. I am going to spend some time to try to move token calculation to use OpenAI's own tiktoken inside my package, as part of v2 work.

lhr0909 avatar Mar 13 '23 14:03 lhr0909

Hi I found one issue where this package doesn't count newlines properly while the gpt tokenizer adds 2 tokens per newline.

Eg for "Hello\n\n" this package returns 2 tokens but the online gpt tokenizer returns 5. Does the package trim the text or such?

image

Aldo111 avatar Apr 11 '23 13:04 Aldo111

I think what you will find is that the online tokenizer does not recognise \n as a newline (but the two characters \ and n). Just put in two hard newlines and you will get 2 tokens, also, look at the token ids for your entered string: [15496, 59, 77, 59, 77] where 59 is \ and 77 is n. Alternatively, test gpt3-tokenizer with the string 'Hello\\n\\n' and it will come out as 5.

kitfit-dave avatar Apr 11 '23 16:04 kitfit-dave

Yep I'm aware of \ + n being counted as separated since it showed it clearly in the tokenizer screenshot above. In your last example then would the most appropriate way be to escape the string (or special chars) before passing it to the tokenizer?

Alternatively what I've ended up doing is using the tokenizer as an estimation and not a fact (which also generally makes sense given the documentation and model differences long term) and following the Deep Dive Counting Tokens guide (for gpt3.5+) in the OpenAI docs. The combination of gpt3-tokenizer with the estimations they've provided in the doc is super helpful and brings the results a bit closer to accuracy.

Aldo111 avatar Apr 12 '23 05:04 Aldo111

For passing to the tokeniser, you should escape in the regular javascript way, so Hello followed by two newlines is "Hello\n\n" - is that not giving you 2 tokens? Or are you saying that you think the answer should be 5? The online tokeniser where your screenshot is from does not accept escaped characters, only literal characters - if you want a newline there you should type a newline, only then are you comparing apples to apples.

I've been commenting on these issues where folks are saying that "it's an estimate" or "it's not correct" because I switched to this library because it seems to be exactly correct. I felt of work had been done in this project to make it so, and I'd like everyone to benefit from that, knowing the results are accurate.

kitfit-dave avatar Apr 12 '23 06:04 kitfit-dave