chatgpttokentool - repeated characters?
Take a look at your constants:
const C0 = 'NORabcdefghilnopqrstuvy'; // plus space that is not following a space const C6 = 'CHLMPQSTUVfkmspwx ';
The characters 'f', 's' and 'p' appear twice. This can't be correct?
I am assuming that C0 is used for these letters?
Hi! Thank you for noticing, you are right, there was some mistake there. Well, I wrote that tool out of interest but ended up never using it. Would you actually use that? I was wondering whether to delete that since the exact process is not quite right, as I found out later, and is for an token encoding that is probably not used anymore, anyway. And the estimation part (where this comes from) is, well, an estimation, and would be language and text type dependent, and is also tuned to that obsolete token encoding. :-). I mostly left that tool in for educational value and since I described it in my blog https://www.stoerr.net/blog/2023-06-23-chatgpt-token-counter.html - the idea how to make such an estimation is somewhat interesting.
Hey,
Thank you for getting back to me, and thank you for your blog link, which I've just read.
Would you actually use that?
Perhaps I'm being naive, as I don't know a great deal about tokenizers yet, but yes I would use something like that as first approach. What I was looking for was a model and language agnostic estimator that was fast and would perhaps slightly over-estimate tokens.
I took the description and turned it into a C# method, which is below. If I use this, it will be for an open source project running local models.
I made a change in detecting surrogate pairs, as a C# char is UTF-16. Am open to improvements and/or changing it.
Thank you for your code and blog.
Andy
/// <summary>
/// Estimates the number of tokens using a language and model agnostic algorithm.
/// </summary>
/// <remarks>
/// Serves as a string extension.
/// </remarks>
public static double EstimateTokens(this string src)
{
// Acknowledgement: Implementation of algorithm described by "stoerr" with changes. Thank you!
// https://community.openai.com/t/how-to-do-a-quick-estimation-of-token-count-of-a-text/277764
// C0 = 'NORabcdefghilnopqrstuvy' // plus space that is not following a space
const double C0 = 0.2020182639633662;
// C1 = '"#%)\*+56789<>?@Z[\\]^|§«äç\''
const double C1 = 0.4790556468110302;
// C2 = '-.ABDEFGIKWY_\r\tz{ü'
const double C2 = 0.3042805747355606;
// C3 = ',01234:~Üß' // incl. unicode characters > 255
const double C3 = 0.6581971122770317;
// C4 = space that is following a space
const double C4 = 0.08086208692099685;
// C5 = '!$&(/;=JX`j\n}ö'
const double C5 = 0.4157646363858563;
// C6 = 'CHLMPQSTUVfkmspwx&NBSP;'
const double C6 = 0.2372744211422125;
// Others
const double CX = 0.980083857442348;
double sum = 0.0;
int len = src.Length;
for (int n = 0; n < len; ++n)
{
var c = src[n];
switch (c)
{
case 'N':
case 'O':
case 'R':
case 'a':
case 'b':
case 'c':
case 'd':
case 'e':
case 'f':
case 'g':
case 'h':
case 'i':
case 'l':
case 'n':
case 'o':
case 'p':
case 'q':
case 'r':
case 's':
case 't':
case 'u':
case 'v':
case 'y':
sum += C0;
continue;
case '"':
case '#':
case '%':
case ')':
case '*':
case '+':
case '5':
case '6':
case '7':
case '8':
case '9':
case '<':
case '>':
case '?':
case '@':
case 'Z':
case '[':
case '\\':
case ']':
case '^':
case '|':
case '§':
// case '«': // removed (has no pair) will go to C3
case 'ä':
case 'ç':
case '\'':
sum += C1;
continue;
case '-':
case '.':
case 'A':
case 'B':
case 'D':
case 'E':
case 'F':
case 'G':
case 'I':
case 'K':
case 'W':
case 'Y':
case '_':
case '\r':
case '\t':
case 'z':
case '{':
case 'ü':
sum += C2;
continue;
case ',':
case '0':
case '1':
case '2':
case '3':
case '4':
case ':':
case '~':
case 'Ü':
case 'ß':
sum += C3;
continue;
case '!':
case '$':
case '&':
case '(':
case '/':
case ';':
case '=':
case 'J':
case 'X':
case '`':
case 'j':
case '\n':
case '}':
case 'ö':
// C4 skip here
sum += C5;
continue;
case 'C':
case 'H':
case 'L':
case 'M':
case 'P':
case 'Q':
case 'S':
case 'T':
case 'U':
case 'V':
case 'k':
case 'm':
case 'w':
case 'x':
case '\u00A0':
sum += C6;
continue;
default:
if (c == ' ')
{
sum += (n > 0 && src[n - 1] == ' ') ? C4 : C0;
continue;
}
if (c > '\u00FF')
{
if (char.IsSurrogate(c))
{
if (n + 1 < len && char.IsHighSurrogate(c) && char.IsLowSurrogate(src[n + 1]))
{
n += 1;
}
sum += CX;
continue;
}
sum += C3;
continue;
}
sum += CX;
continue;
}
}
return sum;
}
Well, it'll be some kind of estimation anyway. But there are many different tokenizations for the many different models in use, and tokens have quite different lengths as you can see scrolling through here https://www.stoerr.net/blog/data/cl100k_decoded.html . The data I used for creating those classes is outdated, as I said. You could try to repeat that process in the blog for a newer token file if you want more precision, but it'll always be.a rough guess - there are libraries that do it exactly if needed.